Outlier Detection#

[1]:
%%capture
# execute the creation & training notebook first
%run "02-01-creation_and_training.ipynb"

Outlier detection is about detecting unusual data point in a test data set. Let’s create such a test data set.

[2]:
np.random.seed(42)
n_data = 100
parameter_a = 5 + np.random.randn(n_data) * 0.1
parameter_b = parameter_a * (-35) + 150 + np.random.randn(n_data) * 1.
parameter_c = parameter_a * 10.5 + parameter_b * (.5) + np.random.randn(n_data) * 0.01

test_data = pd.DataFrame(data={"(a)": parameter_a,
                               "(b|a)": parameter_b,
                               "(c|a,b)": parameter_c})

Now let us hide outliers in the test_data by modifying some of the values.

[3]:
mod_test_data = test_data.copy()
mod_test_data.iloc[50] = [4.9, -27, 41]
mod_test_data.iloc[60] = [5.1, -28, 40.5]

Let’s see how the modified values stand out in scatter plots.

[4]:
pl.figure(figsize=(10,10))
fig = pl.subplot(2,2,1)
fig.scatter(mod_test_data["(a)"], mod_test_data["(b|a)"], marker="x", color="k")
fig.scatter(mod_test_data["(a)"].loc[[50, 60]],
            mod_test_data["(b|a)"].loc[[50, 60]],
            marker="x", color="r")
fig.set_xlabel("(a)")
fig.set_ylabel("(b|a)")
fig = pl.subplot(2,2,3)
fig.scatter(mod_test_data["(a)"], mod_test_data["(c|a,b)"], marker="x", color="k")
fig.scatter(mod_test_data["(a)"].loc[[50, 60]],
            mod_test_data["(c|a,b)"].loc[[50, 60]],
            marker="x", color="r")
fig.set_xlabel("(a)")
fig.set_ylabel("(c|a,b)")
fig = pl.subplot(2,2,2)
fig.scatter(mod_test_data["(c|a,b)"], mod_test_data["(b|a)"], marker="x", color="k")
fig.scatter(mod_test_data["(c|a,b)"].loc[[50, 60]],
            mod_test_data["(b|a)"].loc[[50, 60]],
            marker="x", color="r")
fig.set_xlabel("(c|a,b)")
fig.set_ylabel("(b|a)")
[4]:
Text(0, 0.5, '(b|a)')
../../_images/examples_04_causal_structure_02-05-outlier_detection_7_1.png

Visually the outliers are fairly detectable in at least one of the scatter plots, but individually the values for each parameter are fairly normal. Let’s see what the .detect_outliers method says.

[5]:
outliers = causal_structure.detect_outliers(data=mod_test_data)
outliers
[5]:
(a) (b|a) (c|a,b) graph
0 False False False False
1 False False False False
2 False False False False
3 False False False False
4 False False False False
... ... ... ... ...
95 False False False False
96 False False False False
97 False False False False
98 False False False False
99 False False False False

100 rows × 4 columns

The result is again a pandas.DataFrame with zeros and ones as entries. Zero means no outlier, one means outlier. In a particular row any of the parameters can be classified as an outlier. This estimates which parameter(s) were actually unusual in their specific combination.

There is also a column for the whole graph. This column judges whether the data point as a whole is an outlier. For most purposes this is the decisive column.

Let’s have a look at the rows that contain outliers.

[6]:
detected_outliers = outliers[outliers.sum(axis=1)>0]
detected_outliers
[6]:
(a) (b|a) (c|a,b) graph
25 False True False False
31 True False False False
50 False True True True
60 False False True True
74 True False False False

The outlier detector found the two data points we were looking (50 and 60) for but also others.

Let’s mark them in our plot.

[7]:
pl.figure(figsize=(10,10))
fig = pl.subplot(2,2,1)
fig.scatter(mod_test_data["(a)"], mod_test_data["(b|a)"], marker="x", color="k")
fig.scatter(mod_test_data["(a)"].loc[detected_outliers.index],
            mod_test_data["(b|a)"].loc[detected_outliers.index],
            marker="x", color="orange")
fig.scatter(mod_test_data["(a)"].loc[[50, 60]],
            mod_test_data["(b|a)"].loc[[50, 60]],
            marker="x", color="r")
fig.set_xlabel("(a)")
fig.set_ylabel("(b|a)")
fig = pl.subplot(2,2,3)
fig.scatter(mod_test_data["(a)"], mod_test_data["(c|a,b)"], marker="x", color="k")
fig.scatter(mod_test_data["(a)"].loc[detected_outliers.index],
            mod_test_data["(c|a,b)"].loc[detected_outliers.index],
            marker="x", color="orange")
fig.scatter(mod_test_data["(a)"].loc[[50, 60]],
            mod_test_data["(c|a,b)"].loc[[50, 60]],
            marker="x", color="r")
fig.set_xlabel("(a)")
fig.set_ylabel("(c|a,b)")
fig = pl.subplot(2,2,2)
fig.scatter(mod_test_data["(c|a,b)"], mod_test_data["(b|a)"], marker="x", color="k")
fig.scatter(mod_test_data["(c|a,b)"].loc[detected_outliers.index],
            mod_test_data["(b|a)"].loc[detected_outliers.index],
            marker="x", color="orange")
fig.scatter(mod_test_data["(c|a,b)"].loc[[50, 60]],
            mod_test_data["(b|a)"].loc[[50, 60]],
            marker="x", color="r")
fig.set_xlabel("(c|a,b)")
fig.set_ylabel("(b|a)")
[7]:
Text(0, 0.5, '(b|a)')
../../_images/examples_04_causal_structure_02-05-outlier_detection_13_1.png

The red points are the values we modified (and that the outlier detector identified), the orange points are the other detected outliers.

The outlier threshold#

Why did the outlier detector classify these as outliers if they actually came from the correct data generation process? The reason for this is the outlier threshold. It’s default value is \(0.05\). This means that values which are less likely than the least likely \(5\%\) of the data generation process are classified as outliers. So we expect around \(5\%\) outliers even in data that come from the correct data generation process.

Setting this threshold is a tradeoff between the sensitivity of outlier detection and the rate of false positives.

Let’s try a stricter value, \(0.001\) instead of \(0.05\).

[8]:
outliers = causal_structure.detect_outliers(data=mod_test_data,
                                            outlier_threshold=0.001)
outliers[outliers.sum(axis=1)>0]
[8]:
(a) (b|a) (c|a,b) graph
50 False True True True
60 False False True True

We see that with the stricter threshold the false positives are gone.

However, if we reduce the detection threshold too far we might not detect even the modified data points anymore.

For further details about the OutlierDetector see the corresponding section in the core-documentation.

In the next section we will have a look at influence estimation.