{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Detecting outliers with the OutlierDetector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Halerium objective class `OutlierDetector` can be used to detect outliers, i.e. events that are uncommon or unexpected for a given model.\n", "\n", "By default, events with a probability lower than that of 5% of a large sample of random events are considered outliers. Optionally, that outlier threshold can be set by the user to any value between 0 (no events are considered outliers) and 1 (all but the most likely events are considered outliers)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ingredients" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To detect outliers, you only need a graph and data, and optionally an outlier threshold." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# for handling data:\n", "import numpy as np\n", "\n", "# for plotting:\n", "import matplotlib.pyplot as plt\n", "\n", "# for some Halerium specific functions:\n", "import halerium.core as hal\n", "\n", "# for building graphs:\n", "from halerium.core import Graph, Entity, Variable, show\n", "\n", "# for detecting outliers:\n", "from halerium import OutlierDetector " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The graph and input data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us define a simple graph:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "g = Graph(\"g\")\n", "with g:\n", " Variable(\"z\", mean=0, variance=1)\n", " Variable(\"x\", mean=z, variance=0.1)\n", " Variable(\"y\", mean=z, variance=0.1)\n", " \n", "show(g)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This graph just contains the three variables `x`, `y`, and `z`.\n", "The means of `x` and `y` are set to `z`, and the variances of `x` and `y` are much smaller than that of `z`.\n", "Thus, both `x` and `y` depend on the variable `z` and follow it fairly closely. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's look at a certain set of data." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "data = {g.x: [0, 1, 1, 5, 1],\n", " g.y: [0, 1, -1, 5, np.nan]}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Missing values for specific events are indicated by `np.nan`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The OutlierDetector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We instantiate an `OutlierDetector`." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "outlier_detector = OutlierDetector(graph=g,\n", " data=data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now call the `outlier_detector` with the graph `g` as argument to estimate which of the events in the data are outliers." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([False, False, True, True, False])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "outlier_detector(g)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that events where `x` and `y` both have the same value -1, 0, or 1, and events without any data or just `x = 1` are not considered outliers. In contrast, the event with `x = 1` and `y = -1` is considered an outlier, as well as the event with `x = 5` and `y = 5` or an event with just `y = 5`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One can call the `outlier_detector` with any scopetor of the graph." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([False, False, True, True, False])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "outlier_detector(g.x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The outlier detector then only looks at the values of the variables in that scopetor and judges their likelihood in the context of the provided data. In this case `g.x` is flagged as an outlier when the total graph is as well." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Results for multiple graph elements can be obtained using nested structures, e.g. dicts." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'x': array([False, False, True, True, False]),\n", " 'y': array([False, False, True, True, False]),\n", " 'z': array([False, False, False, True, False])}" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "outlier_detector({'x': g.x, 'y': g.y, 'z': g.z})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we can see, that in the event with `x = 1` and `y = -1`, `z` is not flagged as an outlier while `x` and `y` are. This is because `z` is estimated to be `~0` which is a common value." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calling without an argument yields results for the graph, its subgraphs, entities, and variables as a dict." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'g': array([False, False, True, True, False]),\n", " 'g/z': array([False, False, False, True, False]),\n", " 'g/x': array([False, False, True, True, False]),\n", " 'g/y': array([False, False, True, True, False])}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "outlier_detector()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Options" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Calculation method" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, missing data will be estimated by sampling (`method=\"upsampled\"`). Alternatively, their values can be marginalized over." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Warning: the marginalized method can be very memory consuming.*" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "outlier_detector_default = OutlierDetector(graph=g,\n", " data=data,\n", " method=\"upsampled\")\n", "\n", "outlier_detector_marginalized = OutlierDetector(graph=g,\n", " data=data,\n", " method=\"marginalized\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the case of our example, where there are no data for `g.z`, the difference lies in the outlier detection for the individual variables `g.x` and `g.y`. With the default method \"upsampled\" their values are coupled so that `g.x` was flagged as an outlier when `g.y` was `-1` and `g.x` was `+1`. This is because `g.z` was estimated to be around `0` for that case." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the \"marginalized\" method the individual values are not coupled. After marginalizing over `z` a value of `+1` is not a all uncommon. Therefore that entry is not flagged as an outlier." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([False, False, True, True, False])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "outlier_detector_default(g.x)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([False, False, False, True, False])" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "outlier_detector_marginalized(g.x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Outlier theshold" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, events with a probability less than that of 5% of a large sample of random events are considered outliers. That 5% threshold can be adjusted to other values between 0 and 1. The higher the threshold, the more events are classified as outliers." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'g': array([False, True, True, True, True]),\n", " 'g/z': array([False, True, False, True, True]),\n", " 'g/x': array([ True, True, True, True, True]),\n", " 'g/y': array([ True, True, True, True, True])}" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "outlier_detector = OutlierDetector(graph=g,\n", " data=data,\n", " outlier_threshold=0.7)\n", "outlier_detector()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Accuracy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The accuracy of the `OutlierDetector` is affected by the number of examples used for estimating probabilities and ranks.\n", "\n", "Small numbers of examples yield quick results at the expense of accuracy." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'g': array([False, False, True, True, False]),\n", " 'g/z': array([False, False, False, True, False]),\n", " 'g/x': array([False, False, True, True, False]),\n", " 'g/y': array([False, False, True, True, False])}" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "outlier_detector = OutlierDetector(graph=g,\n", " data=data,\n", " n_samples=100)\n", "outlier_detector()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many examples yield higher accuracy at the expense of speed (in the current implementeation, the computing time increases quadratically with increasing number of examples)." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'g': array([False, False, True, True, False]),\n", " 'g/z': array([False, False, False, True, False]),\n", " 'g/x': array([False, False, True, True, False]),\n", " 'g/y': array([False, False, True, True, False])}" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "outlier_detector = OutlierDetector(graph=g,\n", " data=data,\n", " n_samples=1000)\n", "outlier_detector()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.9" } }, "nbformat": 4, "nbformat_minor": 4 }