{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Outlier Detection" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "# execute the creation & training notebook first\n", "%run \"02-01-creation_and_training.ipynb\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Outlier detection is about detecting unusual data point in a test data set. Let's create such a test data set." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "np.random.seed(42)\n", "n_data = 100\n", "parameter_a = 5 + np.random.randn(n_data) * 0.1\n", "parameter_b = parameter_a * (-35) + 150 + np.random.randn(n_data) * 1.\n", "parameter_c = parameter_a * 10.5 + parameter_b * (.5) + np.random.randn(n_data) * 0.01\n", "\n", "test_data = pd.DataFrame(data={\"(a)\": parameter_a,\n", " \"(b|a)\": parameter_b,\n", " \"(c|a,b)\": parameter_c})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let us hide outliers in the test_data by modifying some of the values." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "mod_test_data = test_data.copy()\n", "mod_test_data.iloc[50] = [4.9, -27, 41]\n", "mod_test_data.iloc[60] = [5.1, -28, 40.5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see how the modified values stand out in scatter plots." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0, 0.5, '(b|a)')" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "pl.figure(figsize=(10,10))\n", "fig = pl.subplot(2,2,1)\n", "fig.scatter(mod_test_data[\"(a)\"], mod_test_data[\"(b|a)\"], marker=\"x\", color=\"k\")\n", "fig.scatter(mod_test_data[\"(a)\"].loc[[50, 60]],\n", " mod_test_data[\"(b|a)\"].loc[[50, 60]],\n", " marker=\"x\", color=\"r\")\n", "fig.set_xlabel(\"(a)\")\n", "fig.set_ylabel(\"(b|a)\")\n", "fig = pl.subplot(2,2,3)\n", "fig.scatter(mod_test_data[\"(a)\"], mod_test_data[\"(c|a,b)\"], marker=\"x\", color=\"k\")\n", "fig.scatter(mod_test_data[\"(a)\"].loc[[50, 60]],\n", " mod_test_data[\"(c|a,b)\"].loc[[50, 60]],\n", " marker=\"x\", color=\"r\")\n", "fig.set_xlabel(\"(a)\")\n", "fig.set_ylabel(\"(c|a,b)\")\n", "fig = pl.subplot(2,2,2)\n", "fig.scatter(mod_test_data[\"(c|a,b)\"], mod_test_data[\"(b|a)\"], marker=\"x\", color=\"k\")\n", "fig.scatter(mod_test_data[\"(c|a,b)\"].loc[[50, 60]],\n", " mod_test_data[\"(b|a)\"].loc[[50, 60]],\n", " marker=\"x\", color=\"r\")\n", "fig.set_xlabel(\"(c|a,b)\")\n", "fig.set_ylabel(\"(b|a)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Visually the outliers are fairly detectable in at least one of the scatter plots, but individually the values for each parameter are fairly normal. Let's see what the ``.detect_outliers`` method says." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
(a)(b|a)(c|a,b)graph
0FalseFalseFalseFalse
1FalseFalseFalseFalse
2FalseFalseFalseFalse
3FalseFalseFalseFalse
4FalseFalseFalseFalse
...............
95FalseFalseFalseFalse
96FalseFalseFalseFalse
97FalseFalseFalseFalse
98FalseFalseFalseFalse
99FalseFalseFalseFalse
\n", "

100 rows × 4 columns

\n", "
" ], "text/plain": [ " (a) (b|a) (c|a,b) graph\n", "0 False False False False\n", "1 False False False False\n", "2 False False False False\n", "3 False False False False\n", "4 False False False False\n", ".. ... ... ... ...\n", "95 False False False False\n", "96 False False False False\n", "97 False False False False\n", "98 False False False False\n", "99 False False False False\n", "\n", "[100 rows x 4 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "outliers = causal_structure.detect_outliers(data=mod_test_data)\n", "outliers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is again a ``pandas.DataFrame`` with zeros and ones as entries. Zero means no outlier, one means outlier.\n", "In a particular row any of the parameters can be classified as an outlier. This estimates which parameter(s) were actually unusual in their specific combination.\n", "\n", "There is also a column for the whole graph. This column judges whether the data point as a whole is an outlier.\n", "For most purposes this is the decisive column.\n", "\n", "Let's have a look at the rows that contain outliers." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
(a)(b|a)(c|a,b)graph
25FalseTrueFalseFalse
31TrueFalseFalseFalse
50FalseTrueTrueTrue
60FalseFalseTrueTrue
74TrueFalseFalseFalse
\n", "
" ], "text/plain": [ " (a) (b|a) (c|a,b) graph\n", "25 False True False False\n", "31 True False False False\n", "50 False True True True\n", "60 False False True True\n", "74 True False False False" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "detected_outliers = outliers[outliers.sum(axis=1)>0]\n", "detected_outliers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The outlier detector found the two data points we were looking (50 and 60) for but also others.\n", "\n", "Let's mark them in our plot." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0, 0.5, '(b|a)')" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "pl.figure(figsize=(10,10))\n", "fig = pl.subplot(2,2,1)\n", "fig.scatter(mod_test_data[\"(a)\"], mod_test_data[\"(b|a)\"], marker=\"x\", color=\"k\")\n", "fig.scatter(mod_test_data[\"(a)\"].loc[detected_outliers.index],\n", " mod_test_data[\"(b|a)\"].loc[detected_outliers.index],\n", " marker=\"x\", color=\"orange\")\n", "fig.scatter(mod_test_data[\"(a)\"].loc[[50, 60]],\n", " mod_test_data[\"(b|a)\"].loc[[50, 60]],\n", " marker=\"x\", color=\"r\")\n", "fig.set_xlabel(\"(a)\")\n", "fig.set_ylabel(\"(b|a)\")\n", "fig = pl.subplot(2,2,3)\n", "fig.scatter(mod_test_data[\"(a)\"], mod_test_data[\"(c|a,b)\"], marker=\"x\", color=\"k\")\n", "fig.scatter(mod_test_data[\"(a)\"].loc[detected_outliers.index],\n", " mod_test_data[\"(c|a,b)\"].loc[detected_outliers.index],\n", " marker=\"x\", color=\"orange\")\n", "fig.scatter(mod_test_data[\"(a)\"].loc[[50, 60]],\n", " mod_test_data[\"(c|a,b)\"].loc[[50, 60]],\n", " marker=\"x\", color=\"r\")\n", "fig.set_xlabel(\"(a)\")\n", "fig.set_ylabel(\"(c|a,b)\")\n", "fig = pl.subplot(2,2,2)\n", "fig.scatter(mod_test_data[\"(c|a,b)\"], mod_test_data[\"(b|a)\"], marker=\"x\", color=\"k\")\n", "fig.scatter(mod_test_data[\"(c|a,b)\"].loc[detected_outliers.index],\n", " mod_test_data[\"(b|a)\"].loc[detected_outliers.index],\n", " marker=\"x\", color=\"orange\")\n", "fig.scatter(mod_test_data[\"(c|a,b)\"].loc[[50, 60]],\n", " mod_test_data[\"(b|a)\"].loc[[50, 60]],\n", " marker=\"x\", color=\"r\")\n", "fig.set_xlabel(\"(c|a,b)\")\n", "fig.set_ylabel(\"(b|a)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The red points are the values we modified (and that the outlier detector identified), the orange points are the other detected outliers. \n", "\n", "#### The outlier threshold\n", "\n", "Why did the outlier detector classify these as outliers if they actually came from the correct data generation process? The reason for this is the outlier threshold. It's default value is $0.05$. This means that values which are less likely than the least likely $5\\%$ of the data generation process are classified as outliers. So we expect around $5\\%$ outliers even in data that come from the correct data generation process.\n", "\n", "Setting this threshold is a tradeoff between the sensitivity of outlier detection and the rate of false positives.\n", "\n", "Let's try a stricter value, $0.001$ instead of $0.05$." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
(a)(b|a)(c|a,b)graph
50FalseTrueTrueTrue
60FalseFalseTrueTrue
\n", "
" ], "text/plain": [ " (a) (b|a) (c|a,b) graph\n", "50 False True True True\n", "60 False False True True" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "outliers = causal_structure.detect_outliers(data=mod_test_data,\n", " outlier_threshold=0.001)\n", "outliers[outliers.sum(axis=1)>0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that with the stricter threshold the false positives are gone.\n", "\n", "However, if we reduce the detection threshold too far we might not detect even the modified data points anymore." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For further details about the ``OutlierDetector`` see the [corresponding section](../02_objectives/06_outlier_detector.ipynb) in the core-documentation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the [next section](./02-06-influence_estimation.ipynb) we will have a look at influence estimation." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 4 }