{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Rank Estimation" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "# execute the creation & training notebook first\n", "%run \"02-01-creation_and_training.ipynb\"\n", "# execute the outlier detection notebook\n", "%run \"02-05-outlier_detection.ipynb\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the [outlier detection section](./02-05-outlier_detection.ipynb) we saw how to detect outliers in a test data set and how the outlier threshold influenced the detection results.\n", "\n", "In this section we take a look at the ``.estimate_ranks`` method. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The corresponding class ``RankEstimator`` is actually used by the ``OutlierDetector`` under the hood to estimate the typicality of individual data points. Let us apply the rank estimator to the modified test data set from the outlier detection example." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
(a)(b|a)(c|a,b)graph
00.5470.2670.7420.660
10.9740.8390.8760.987
20.4520.8730.7570.877
30.0890.5750.5550.313
40.8800.8910.8690.991
...............
950.1710.8400.5400.506
960.6600.5760.8000.897
970.6840.7610.8500.946
980.9030.8300.8630.985
990.8790.3690.8790.853
\n", "

100 rows × 4 columns

\n", "
" ], "text/plain": [ " (a) (b|a) (c|a,b) graph\n", "0 0.547 0.267 0.742 0.660\n", "1 0.974 0.839 0.876 0.987\n", "2 0.452 0.873 0.757 0.877\n", "3 0.089 0.575 0.555 0.313\n", "4 0.880 0.891 0.869 0.991\n", ".. ... ... ... ...\n", "95 0.171 0.840 0.540 0.506\n", "96 0.660 0.576 0.800 0.897\n", "97 0.684 0.761 0.850 0.946\n", "98 0.903 0.830 0.863 0.985\n", "99 0.879 0.369 0.879 0.853\n", "\n", "[100 rows x 4 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ranks = causal_structure.estimate_ranks(data=mod_test_data)\n", "ranks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the ranks are numbers between zero and one. These numbers describe the fraction of data points in a comparison data set with a lower probability that the data point in question. The comparison data set is generated from the trained graph under the hood.\n", "\n", "For example, a value of 0.8 means that 80% of the generated comparison data had a probability that was lower than the data point in question.\n", "\n", "This is the continuous analogon to the outlier detection. Consequently, we can apply the outlier treshold of 0.05 to the ranks." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
(a)(b|a)(c|a,b)graph
250.8140.0300.7940.173
310.0450.7690.5070.193
500.3660.0000.0000.000
600.2620.7120.0000.000
740.0090.5230.4030.045
\n", "
" ], "text/plain": [ " (a) (b|a) (c|a,b) graph\n", "25 0.814 0.030 0.794 0.173\n", "31 0.045 0.769 0.507 0.193\n", "50 0.366 0.000 0.000 0.000\n", "60 0.262 0.712 0.000 0.000\n", "74 0.009 0.523 0.403 0.045" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ranks.loc[(ranks<=0.05).any(axis=1)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We find pretty much the same entries as in the [outlier detection section](./02-05-outlier_detection.ipynb). Some ranks are close to the threshold and might sometimes lie above or below the threshold depending on sampling effects." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rank estimation tells us how unusual certain data points are. We see that the modified data points with indices 50 and 60 are judged to be very unlikely. None of the 1000 comparison data points had a lower probability than those.\n", "\n", "Let's examine the ranks of the data points 50 and 60 more closely." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ``.estimate_ranks`` method tells us that the values of '(a)' are not particularly unlikely in both of them.\n", "\n", "The value of '(b|a)' is judged to be common for data point 60, and next to impossible for data point 50.\n", "There is nothing particularly unusual about the value of '(b|a)' in data point 50 if we look at a 1-d histogram." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "pl.hist(mod_test_data[\"(b|a)\"], density=True)\n", "pl.vlines(mod_test_data[\"(b|a)\"][[50,60]], [0,0], [0.15, 0.15],\n", " colors=\"r\", label=\"data points 50 and 60\", ls=\"--\")\n", "pl.xlabel(\"(b|a)\")\n", "pl.ylabel(\"frequency\")\n", "pl.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, given the values of '(a)' they certainly are unlikely as we can see from a 2-d scatter plot." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "pl.scatter(mod_test_data[\"(a)\"], mod_test_data[\"(b|a)\"],\n", " marker=\"x\", color=\"k\", label=\"data points\")\n", "pl.scatter(mod_test_data[\"(a)\"].loc[[50]],\n", " mod_test_data[\"(b|a)\"].loc[[50]],\n", " marker=\"x\", color=\"r\", label=\"data point 50\")\n", "pl.xlabel(\"(a)\")\n", "pl.ylabel(\"(b|a)\")\n", "pl.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The unlikelyness of the combination of the values of '(a)' and '(b|a)' is attributed to '(b|a)' along the causal directions. If '(a)' causes '(b|a)' and the value of '(a)' is common by itself, then the unusual behavior must have occured in the generation of '(b|a)'." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For further details about the ``RankEstimator`` see the [corresponding section](../02_objectives/05_rank_estimator.ipynb) in the core-documentation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Internally the ``RankEstimator`` uses the ``ProbabilityEstimator`` which is explained in the [next section](./02-08-probability_estimation.ipynb)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 4 }