# Detecting outliers with the OutlierDetector¶

The Halerium objective class `OutlierDetector`

can be used to detect outliers, i.e. events that are uncommon or unexpected for a given model.

By default, events with a probability lower than that of 5% of a large sample of random events are considered outliers. Optionally, that outlier threshold can be set by the user to any value between 0 (no events are considered outliers) and 1 (all but the most likely events are considered outliers).

## Ingredients¶

To detect outliers, you only need a graph and data, and optionally an outlier threshold.

## Imports¶

```
[1]:
```

```
# for handling data:
import numpy as np
# for plotting:
import matplotlib.pyplot as plt
# for some Halerium specific functions:
import halerium.core as hal
# for building graphs:
from halerium.core import Graph, Entity, Variable, show
# for detecting outliers:
from halerium import OutlierDetector
```

## The graph and input data¶

Let us define a simple graph:

```
[2]:
```

```
g = Graph("g")
with g:
Variable("z", mean=0, variance=1)
Variable("x", mean=z, variance=0.1)
Variable("y", mean=z, variance=0.1)
show(g)
```

This graph just contains the three variables `x`

, `y`

, and `z`

. The means of `x`

and `y`

are set to `z`

, and the variances of `x`

and `y`

are much smaller than that of `z`

. Thus, both `x`

and `y`

depend on the variable `z`

and follow it fairly closely.

Now let’s look at a certain set of data.

```
[3]:
```

```
data = {g.x: [0, 1, 1, 5, 1],
g.y: [0, 1, -1, 5, np.nan]}
```

Missing values for specific events are indicated by `np.nan`

.

## The OutlierDetector¶

We instantiate an `OutlierDetector`

.

```
[4]:
```

```
outlier_detector = OutlierDetector(graph=g,
data=data)
```

We can now call the `outlier_detector`

with the graph `g`

as argument to estimate which of the events in the data are outliers.

```
[5]:
```

```
outlier_detector(g)
```

```
[5]:
```

```
array([False, False, True, True, False])
```

We can see that events where `x`

and `y`

both have the same value -1, 0, or 1, and events without any data or just `x = 1`

are not considered outliers. In contrast, the event with `x = 1`

and `y = -1`

is considered an outlier, as well as the event with `x = 5`

and `y = 5`

or an event with just `y = 5`

.

One can call the `outlier_detector`

with any scopetor of the graph.

```
[6]:
```

```
outlier_detector(g.x)
```

```
[6]:
```

```
array([False, False, True, True, False])
```

The outlier detector then only looks at the values of the variables in that scopetor and judges their likelihood in the context of the provided data. In this case `g.x`

is flagged as an outlier when the total graph is as well.

Results for multiple graph elements can be obtained using nested structures, e.g. dicts.

```
[7]:
```

```
outlier_detector({'x': g.x, 'y': g.y, 'z': g.z})
```

```
[7]:
```

```
{'x': array([False, False, True, True, False]),
'y': array([False, False, True, True, False]),
'z': array([False, False, False, True, False])}
```

Here we can see, that in the event with `x = 1`

and `y = -1`

, `z`

is not flagged as an outlier while `x`

and `y`

are. This is because `z`

is estimated to be `~0`

which is a common value.

Calling without an argument yields results for the graph, its subgraphs, entities, and variables as a dict.

```
[8]:
```

```
outlier_detector()
```

```
[8]:
```

```
{'g': array([False, False, True, True, False]),
'g/z': array([False, False, False, True, False]),
'g/x': array([False, False, True, True, False]),
'g/y': array([False, False, True, True, False])}
```

## Options¶

### Calculation method¶

By default, missing data will be estimated by sampling (`method="upsampled"`

). Alternatively, their values can be marginalized over.

*Warning: the marginalized method can be very memory consuming.*

```
[9]:
```

```
outlier_detector_default = OutlierDetector(graph=g,
data=data,
method="upsampled")
outlier_detector_marginalized = OutlierDetector(graph=g,
data=data,
method="marginalized")
```

In the case of our example, where there are no data for `g.z`

, the difference lies in the outlier detection for the individual variables `g.x`

and `g.y`

. With the default method “upsampled” their values are coupled so that `g.x`

was flagged as an outlier when `g.y`

was `-1`

and `g.x`

was `+1`

. This is because `g.z`

was estimated to be around `0`

for that case.

With the “marginalized” method the individual values are not coupled. After marginalizing over `z`

a value of `+1`

is not a all uncommon. Therefore that entry is not flagged as an outlier.

```
[10]:
```

```
outlier_detector_default(g.x)
```

```
[10]:
```

```
array([False, False, True, True, False])
```

```
[11]:
```

```
outlier_detector_marginalized(g.x)
```

```
[11]:
```

```
array([False, False, False, True, False])
```

### Outlier theshold¶

By default, events with a probability less than that of 5% of a large sample of random events are considered outliers. That 5% threshold can be adjusted to other values between 0 and 1. The higher the threshold, the more events are classified as outliers.

```
[12]:
```

```
outlier_detector = OutlierDetector(graph=g,
data=data,
outlier_threshold=0.7)
outlier_detector()
```

```
[12]:
```

```
{'g': array([False, True, True, True, True]),
'g/z': array([False, True, False, True, True]),
'g/x': array([ True, True, True, True, True]),
'g/y': array([ True, True, True, True, True])}
```

### Accuracy¶

The accuracy of the `OutlierDetector`

is affected by the number of examples used for estimating probabilities and ranks.

Small numbers of examples yield quick results at the expense of accuracy.

```
[13]:
```

```
outlier_detector = OutlierDetector(graph=g,
data=data,
n_samples=100)
outlier_detector()
```

```
[13]:
```

```
{'g': array([False, False, True, True, False]),
'g/z': array([False, False, False, True, False]),
'g/x': array([False, False, True, True, False]),
'g/y': array([False, False, True, True, False])}
```

Many examples yield higher accuracy at the expense of speed (in the current implementeation, the computing time increases quadratically with increasing number of examples).

```
[14]:
```

```
outlier_detector = OutlierDetector(graph=g,
data=data,
n_samples=1000)
outlier_detector()
```

```
[14]:
```

```
{'g': array([False, False, True, True, False]),
'g/z': array([False, False, False, True, False]),
'g/x': array([False, False, True, True, False]),
'g/y': array([False, False, True, True, False])}
```

```
[ ]:
```

```
```