{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Causal Structures and Dependencies"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In many cases the only model information available is the causal structure of the data generating process.\n",
"\n",
"This information can be as simple as \"a, b, and c are the inputs and d, e, and f are the outputs.\", but can also be more hierarchical such as \"a influences b. b and c influence d.\".\n",
"\n",
"This information can be conveniently stored in a `CausalStructure`, which represents a collection of dependencies."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Dependencies"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dependencies are the building blocks of a causal structure.\n",
"\n",
"A single `Dependency` expresses that an output or a group of outputs depend on an input or a group of inputs."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"from halerium.causal_structure import Dependency, Dependencies\n",
"from halerium import CausalStructure"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dependency(features={'c', 'a', 'b'}, targets={'e', 'd', 'f'})\n"
]
}
],
"source": [
"dep = Dependency(inputs={\"a\", \"b\", \"c\"}, outputs={\"d\", \"e\", \"f\"})\n",
"print(dep)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can instanciate a dependency in various ways using positional arguments, various key-word arguments or a simple list or dict. These all do the same:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dependency(features={'a'}, targets={'b'})\n",
"Dependency(features={'a'}, targets={'b'})\n",
"Dependency(features={'a'}, targets={'b'})\n",
"Dependency(features={'a'}, targets={'b'})\n",
"Dependency(features={'a'}, targets={'b'})\n"
]
}
],
"source": [
"print(Dependency(\"a\", \"b\"))\n",
"print(Dependency([\"a\", \"b\"]))\n",
"print(Dependency([[\"a\"], [\"b\"]]))\n",
"print(Dependency(features=\"a\", targets=\"b\"))\n",
"print(Dependency({\"features\": \"a\", \"targets\": \"b\"}))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A dependency must be acyclic in the sense that the same string must not be in both features and targets."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CyclicDependencyError(\"Cyclic dependency detected for 'a'.\")\n"
]
}
],
"source": [
"try:\n",
" Dependency(\"a\", \"a\")\n",
"except Exception as exc:\n",
" print(repr(exc))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Multiple dependencies can be grouped in the `Dependencies` object."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Dependencies([[{'a'}, 'b'],\n",
" [{'b'}, 'c'],\n",
" [{'c', 'b'}, 'd']])"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Dependencies([\n",
" Dependency(\"a\", \"b\"),\n",
" Dependency(\"b\", \"c\"),\n",
" Dependency([\"b\", \"c\"], \"d\"),\n",
"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The dependencies will be checked for cyclic dependencies."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CyclicDependencyError(\"Cyclic dependency detected for {'b'}.\")\n"
]
}
],
"source": [
"dependency_list = [\n",
" Dependency(\"a\", \"b\"),\n",
" Dependency(\"b\", \"c\"),\n",
" Dependency(\"c\", \"a\"),\n",
"]\n",
"\n",
"try:\n",
" Dependencies(dependency_list)\n",
"except Exception as exc:\n",
" print(repr(exc))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dependencies can be instanciated directly from lists:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Dependencies([[{'c', 'a', 'b'}, 'e'],\n",
" [{'c', 'a', 'b'}, 'd'],\n",
" [{'c'}, 'f']])"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Dependencies([[{'a', 'b', 'c'}, {'d', 'e'}],\n",
" ['c', 'f']])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### CausalStructure"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `CausalStructure` class provides the interface between dependencies, pandas data frames and the low-level Halerium object like `Graph` and `Variable`.\n",
"\n",
"The causal structure will build a graph that respects the dependencies and expresses them mathematically via regression. For every element in the Dependencies a `Variable` with an internal name will be created. The `CausalStructure` instance allows the user to train and evaluate the underlying `Graph` without caring about the internal details. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Basic Usage"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The most important methods of the `CausalStructure` class are `train`, `predict` and `evaluate_objective`. Let's go through a minimal example."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ideally the data are a pandas `DataFrame`."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.0416837 | \n",
" 1.07189 | \n",
" 0.577627 | \n",
" 0.00173753 | \n",
"
\n",
" \n",
" 1 | \n",
" -0.0564439 | \n",
" -0.379637 | \n",
" -0.246262 | \n",
" 0.00318592 | \n",
"
\n",
" \n",
" 2 | \n",
" -1.06954 | \n",
" 0.0600041 | \n",
" -1.03953 | \n",
" 1.14391 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.526029 | \n",
" 0.253401 | \n",
" 0.65273 | \n",
" 0.276706 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.153498 | \n",
" 0.285647 | \n",
" 0.296322 | \n",
" 0.0235616 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" a b c d\n",
"0 0.0416837 1.07189 0.577627 0.00173753\n",
"1 -0.0564439 -0.379637 -0.246262 0.00318592\n",
"2 -1.06954 0.0600041 -1.03953 1.14391\n",
"3 0.526029 0.253401 0.65273 0.276706\n",
"4 0.153498 0.285647 0.296322 0.0235616"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.DataFrame(columns=[\"a\", \"b\", \"c\", \"d\"], index=range(5))\n",
"data[[\"a\", \"b\"]] = np.random.randn(5,2)\n",
"data[\"c\"] = data[\"a\"] + 0.5 * data[\"b\"]\n",
"data[\"d\"] = data[\"a\"]**2\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We instanciate the `CausalStructure` providing the assumed dependecy structure of the columns of your data frame.\n",
"In this case we say columns \"a\" and \"b\" influence columns \"c\" and \"d\"."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"CausalStructure([[{'a', 'b'}, 'c'],\n",
" [{'a', 'b'}, 'd']])"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cs = CausalStructure([[{\"a\", \"b\"}, {\"c\", \"d\"}]])\n",
"cs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We train our causal structure by simply executing"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"cs.train(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now get predictions from the underlying trained graph."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.0416837 | \n",
" 1.07189 | \n",
"
\n",
" \n",
" 1 | \n",
" -0.0564439 | \n",
" -0.379637 | \n",
"
\n",
" \n",
" 2 | \n",
" -1.06954 | \n",
" 0.0600041 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.526029 | \n",
" 0.253401 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.153498 | \n",
" 0.285647 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" a b\n",
"0 0.0416837 1.07189\n",
"1 -0.0564439 -0.379637\n",
"2 -1.06954 0.0600041\n",
"3 0.526029 0.253401\n",
"4 0.153498 0.285647"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"input_data = data[[\"a\",\"b\"]]\n",
"input_data"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" c | \n",
" a | \n",
" b | \n",
" d | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.578952 | \n",
" 0.041684 | \n",
" 1.071887 | \n",
" -0.011787 | \n",
"
\n",
" \n",
" 1 | \n",
" -0.234416 | \n",
" -0.056444 | \n",
" -0.379637 | \n",
" 0.018212 | \n",
"
\n",
" \n",
" 2 | \n",
" -1.018916 | \n",
" -1.069536 | \n",
" 0.060004 | \n",
" 1.136911 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.616478 | \n",
" 0.526029 | \n",
" 0.253401 | \n",
" 0.257356 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.301972 | \n",
" 0.153498 | \n",
" 0.285647 | \n",
" 0.072325 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" c a b d\n",
"0 0.578952 0.041684 1.071887 -0.011787\n",
"1 -0.234416 -0.056444 -0.379637 0.018212\n",
"2 -1.018916 -1.069536 0.060004 1.136911\n",
"3 0.616478 0.526029 0.253401 0.257356\n",
"4 0.301972 0.153498 0.285647 0.072325"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prediction = cs.predict(input_data)\n",
"prediction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The prediction always returns the values for all internal variables, not only the outputs."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}