Detecting outliers with the OutlierDetector#

The Halerium objective class OutlierDetector can be used to detect outliers, i.e. events that are uncommon or unexpected for a given model.

By default, events with a probability lower than that of 5% of a large sample of random events are considered outliers. Optionally, that outlier threshold can be set by the user to any value between 0 (no events are considered outliers) and 1 (all but the most likely events are considered outliers).

Ingredients#

To detect outliers, you only need a graph and data, and optionally an outlier threshold.

Imports#

[1]:
# for handling data:
import numpy as np

# for plotting:
import matplotlib.pyplot as plt

# for some Halerium specific functions:
import halerium.core as hal

# for building graphs:
from halerium.core import Graph, Entity, Variable, show

# for detecting outliers:
from halerium import OutlierDetector

The graph and input data#

Let us define a simple graph:

[2]:
g = Graph("g")
with g:
    Variable("z", mean=0, variance=1)
    Variable("x", mean=z, variance=0.1)
    Variable("y", mean=z, variance=0.1)

show(g)

This graph just contains the three variables x, y, and z. The means of x and y are set to z, and the variances of x and y are much smaller than that of z. Thus, both x and y depend on the variable z and follow it fairly closely.

Now let’s look at a certain set of data.

[3]:
data = {g.x: [0,  1,  1, 5,      1],
        g.y: [0,  1, -1, 5, np.nan]}

Missing values for specific events are indicated by np.nan.

The OutlierDetector#

We instantiate an OutlierDetector.

[4]:
outlier_detector = OutlierDetector(graph=g,
                                   data=data)

We can now call the outlier_detector with the graph g as argument to estimate which of the events in the data are outliers.

[5]:
outlier_detector(g)
[5]:
array([False, False,  True,  True, False])

We can see that events where x and y both have the same value -1, 0, or 1, and events without any data or just x = 1 are not considered outliers. In contrast, the event with x = 1 and y = -1 is considered an outlier, as well as the event with x = 5 and y = 5 or an event with just y = 5.

One can call the outlier_detector with any scopetor of the graph.

[6]:
outlier_detector(g.x)
[6]:
array([False, False,  True,  True, False])

The outlier detector then only looks at the values of the variables in that scopetor and judges their likelihood in the context of the provided data. In this case g.x is flagged as an outlier when the total graph is as well.

Results for multiple graph elements can be obtained using nested structures, e.g. dicts.

[7]:
outlier_detector({'x': g.x, 'y': g.y, 'z': g.z})
[7]:
{'x': array([False, False,  True,  True, False]),
 'y': array([False, False,  True,  True, False]),
 'z': array([False, False, False,  True, False])}

Here we can see, that in the event with x = 1 and y = -1, z is not flagged as an outlier while x and y are. This is because z is estimated to be ~0 which is a common value.

Calling without an argument yields results for the graph, its subgraphs, entities, and variables as a dict.

[8]:
outlier_detector()
[8]:
{'g': array([False, False,  True,  True, False]),
 'g/z': array([False, False, False,  True, False]),
 'g/x': array([False, False,  True,  True, False]),
 'g/y': array([False, False,  True,  True, False])}

Options#

Calculation method#

By default, missing data will be estimated by sampling (method="upsampled"). Alternatively, their values can be marginalized over.

Warning: the marginalized method can be very memory consuming.

[9]:
outlier_detector_default = OutlierDetector(graph=g,
                                           data=data,
                                           method="upsampled")

outlier_detector_marginalized = OutlierDetector(graph=g,
                                                data=data,
                                                method="marginalized")

In the case of our example, where there are no data for g.z, the difference lies in the outlier detection for the individual variables g.x and g.y. With the default method “upsampled” their values are coupled so that g.x was flagged as an outlier when g.y was -1 and g.x was +1. This is because g.z was estimated to be around 0 for that case.

With the “marginalized” method the individual values are not coupled. After marginalizing over z a value of +1 is not a all uncommon. Therefore that entry is not flagged as an outlier.

[10]:
outlier_detector_default(g.x)
[10]:
array([False, False,  True,  True, False])
[11]:
outlier_detector_marginalized(g.x)
[11]:
array([False, False, False,  True, False])

Outlier theshold#

By default, events with a probability less than that of 5% of a large sample of random events are considered outliers. That 5% threshold can be adjusted to other values between 0 and 1. The higher the threshold, the more events are classified as outliers.

[12]:
outlier_detector = OutlierDetector(graph=g,
                                   data=data,
                                   outlier_threshold=0.7)
outlier_detector()
[12]:
{'g': array([False,  True,  True,  True,  True]),
 'g/z': array([False,  True, False,  True,  True]),
 'g/x': array([ True,  True,  True,  True,  True]),
 'g/y': array([ True,  True,  True,  True,  True])}

Accuracy#

The accuracy of the OutlierDetector is affected by the number of examples used for estimating probabilities and ranks.

Small numbers of examples yield quick results at the expense of accuracy.

[13]:
outlier_detector = OutlierDetector(graph=g,
                                   data=data,
                                   n_samples=100)
outlier_detector()
[13]:
{'g': array([False, False,  True,  True, False]),
 'g/z': array([False, False, False,  True, False]),
 'g/x': array([False, False,  True,  True, False]),
 'g/y': array([False, False,  True,  True, False])}

Many examples yield higher accuracy at the expense of speed (in the current implementeation, the computing time increases quadratically with increasing number of examples).

[14]:
outlier_detector = OutlierDetector(graph=g,
                                   data=data,
                                   n_samples=1000)
outlier_detector()
[14]:
{'g': array([False, False,  True,  True, False]),
 'g/z': array([False, False, False,  True, False]),
 'g/x': array([False, False,  True,  True, False]),
 'g/y': array([False, False,  True,  True, False])}
[ ]: