{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Detecting outliers with the OutlierDetector"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Halerium objective class `OutlierDetector` can be used to detect outliers, i.e. events that are uncommon or unexpected for a given model.\n",
    "\n",
    "By default, events with a probability lower than that of 5% of a large sample of random events are considered outliers. Optionally, that outlier threshold can be set by the user to any value between 0 (no events are considered outliers) and 1 (all but the most likely events are considered outliers)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Ingredients"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To detect outliers, you only need a graph and data, and optionally an outlier threshold."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# for handling data:\n",
    "import numpy as np\n",
    "\n",
    "# for plotting:\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# for some Halerium specific functions:\n",
    "import halerium.core as hal\n",
    "\n",
    "# for building graphs:\n",
    "from halerium.core import Graph, Entity, Variable, show\n",
    "\n",
    "# for detecting outliers:\n",
    "from halerium import OutlierDetector "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The graph and input data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us define a simple graph:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "g = Graph(\"g\")\n",
    "with g:\n",
    "    Variable(\"z\", mean=0, variance=1)\n",
    "    Variable(\"x\", mean=z, variance=0.1)\n",
    "    Variable(\"y\", mean=z, variance=0.1)\n",
    "    \n",
    "show(g)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This graph just contains the three variables `x`, `y`, and `z`.\n",
    "The means of `x` and `y` are set to `z`, and the variances of `x` and `y` are much smaller than that of `z`.\n",
    "Thus, both `x` and `y` depend on the variable `z` and follow it fairly closely. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's look at a certain set of data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = {g.x: [0,  1,  1, 5,      1],\n",
    "        g.y: [0,  1, -1, 5, np.nan]}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Missing values for specific events are indicated by `np.nan`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The OutlierDetector"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We instantiate an `OutlierDetector`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "outlier_detector = OutlierDetector(graph=g,\n",
    "                                   data=data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now call the `outlier_detector` with the graph `g` as argument to estimate which of the events in the data are outliers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([False, False,  True,  True, False])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "outlier_detector(g)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that events where `x` and `y` both have the same value -1, 0, or 1, and events without any data or just `x = 1`  are not considered outliers. In contrast, the event with `x = 1` and `y = -1` is considered an outlier, as well as the event with `x = 5` and `y = 5` or an event with just `y = 5`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One can call the `outlier_detector` with any scopetor of the graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([False, False,  True,  True, False])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "outlier_detector(g.x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The outlier detector then only looks at the values of the variables in that scopetor and judges their likelihood in the context of the provided data. In this case `g.x` is flagged as an outlier when the total graph is as well."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Results for multiple graph elements can be obtained using nested structures, e.g. dicts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'x': array([False, False,  True,  True, False]),\n",
       " 'y': array([False, False,  True,  True, False]),\n",
       " 'z': array([False, False, False,  True, False])}"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "outlier_detector({'x': g.x, 'y': g.y, 'z': g.z})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we can see, that in the event with `x = 1` and `y = -1`, `z` is not flagged as an outlier while `x` and `y` are. This is because `z` is estimated to be `~0` which is a common value."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calling without an argument yields results for the graph, its subgraphs, entities, and variables as a dict."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'g': array([False, False,  True,  True, False]),\n",
       " 'g/z': array([False, False, False,  True, False]),\n",
       " 'g/x': array([False, False,  True,  True, False]),\n",
       " 'g/y': array([False, False,  True,  True, False])}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "outlier_detector()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Options"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Calculation method"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, missing data will be estimated by sampling (`method=\"upsampled\"`). Alternatively, their values can be marginalized over."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Warning: the marginalized method can be very memory consuming.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "outlier_detector_default = OutlierDetector(graph=g,\n",
    "                                           data=data,\n",
    "                                           method=\"upsampled\")\n",
    "\n",
    "outlier_detector_marginalized = OutlierDetector(graph=g,\n",
    "                                                data=data,\n",
    "                                                method=\"marginalized\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the case of our example, where there are no data for `g.z`, the difference lies in the outlier detection for the individual variables `g.x` and `g.y`. With the default method \"upsampled\" their values are coupled so that `g.x` was flagged as an outlier when `g.y` was `-1` and `g.x` was `+1`. This is because `g.z` was estimated to be around `0` for that case."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the \"marginalized\" method the individual values are not coupled. After marginalizing over `z` a value of `+1` is not a all uncommon. Therefore that entry is not flagged as an outlier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([False, False,  True,  True, False])"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "outlier_detector_default(g.x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([False, False, False,  True, False])"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "outlier_detector_marginalized(g.x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Outlier theshold"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, events with a probability less than that of 5% of a large sample of random events are considered outliers. That 5% threshold can be adjusted to other values between 0 and 1. The higher the threshold, the more events are classified as outliers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'g': array([False,  True,  True,  True,  True]),\n",
       " 'g/z': array([False,  True, False,  True,  True]),\n",
       " 'g/x': array([ True,  True,  True,  True,  True]),\n",
       " 'g/y': array([ True,  True,  True,  True,  True])}"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "outlier_detector = OutlierDetector(graph=g,\n",
    "                                   data=data,\n",
    "                                   outlier_threshold=0.7)\n",
    "outlier_detector()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Accuracy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The accuracy of the `OutlierDetector` is affected by the number of examples used for estimating probabilities and ranks.\n",
    "\n",
    "Small numbers of examples yield quick results at the expense of accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'g': array([False, False,  True,  True, False]),\n",
       " 'g/z': array([False, False, False,  True, False]),\n",
       " 'g/x': array([False, False,  True,  True, False]),\n",
       " 'g/y': array([False, False,  True,  True, False])}"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "outlier_detector = OutlierDetector(graph=g,\n",
    "                                   data=data,\n",
    "                                   n_samples=100)\n",
    "outlier_detector()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Many examples yield higher accuracy at the expense of speed (in the current implementeation, the computing time increases  quadratically with increasing number of examples)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'g': array([False, False,  True,  True, False]),\n",
       " 'g/z': array([False, False, False,  True, False]),\n",
       " 'g/x': array([False, False,  True,  True, False]),\n",
       " 'g/y': array([False, False,  True,  True, False])}"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "outlier_detector = OutlierDetector(graph=g,\n",
    "                                   data=data,\n",
    "                                   n_samples=1000)\n",
    "outlier_detector()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}