Causal Structures and Dependencies#
In many cases the only model information available is the causal structure of the data generating process.
This information can be as simple as “a, b, and c are the inputs and d, e, and f are the outputs.”, but can also be more hierarchical such as “a influences b. b and c influence d.”.
This information can be conveniently stored in a CausalStructure
, which represents a collection of dependencies.
Dependencies#
Dependencies are the building blocks of a causal structure.
A single Dependency
expresses that an output or a group of outputs depend on an input or a group of inputs.
[1]:
import pandas as pd
import numpy as np
from halerium.causal_structure import Dependency, Dependencies
from halerium import CausalStructure
[2]:
dep = Dependency(inputs={"a", "b", "c"}, outputs={"d", "e", "f"})
print(dep)
Dependency(features={'c', 'a', 'b'}, targets={'e', 'd', 'f'})
You can instanciate a dependency in various ways using positional arguments, various key-word arguments or a simple list or dict. These all do the same:
[3]:
print(Dependency("a", "b"))
print(Dependency(["a", "b"]))
print(Dependency([["a"], ["b"]]))
print(Dependency(features="a", targets="b"))
print(Dependency({"features": "a", "targets": "b"}))
Dependency(features={'a'}, targets={'b'})
Dependency(features={'a'}, targets={'b'})
Dependency(features={'a'}, targets={'b'})
Dependency(features={'a'}, targets={'b'})
Dependency(features={'a'}, targets={'b'})
A dependency must be acyclic in the sense that the same string must not be in both features and targets.
[4]:
try:
Dependency("a", "a")
except Exception as exc:
print(repr(exc))
CyclicDependencyError("Cyclic dependency detected for 'a'.")
Multiple dependencies can be grouped in the Dependencies
object.
[5]:
Dependencies([
Dependency("a", "b"),
Dependency("b", "c"),
Dependency(["b", "c"], "d"),
])
[5]:
Dependencies([[{'a'}, 'b'],
[{'b'}, 'c'],
[{'c', 'b'}, 'd']])
The dependencies will be checked for cyclic dependencies.
[6]:
dependency_list = [
Dependency("a", "b"),
Dependency("b", "c"),
Dependency("c", "a"),
]
try:
Dependencies(dependency_list)
except Exception as exc:
print(repr(exc))
CyclicDependencyError("Cyclic dependency detected for {'b'}.")
Dependencies can be instanciated directly from lists:
[7]:
Dependencies([[{'a', 'b', 'c'}, {'d', 'e'}],
['c', 'f']])
[7]:
Dependencies([[{'c', 'a', 'b'}, 'e'],
[{'c', 'a', 'b'}, 'd'],
[{'c'}, 'f']])
CausalStructure#
The CausalStructure
class provides the interface between dependencies, pandas data frames and the low-level Halerium object like Graph
and Variable
.
The causal structure will build a graph that respects the dependencies and expresses them mathematically via regression. For every element in the Dependencies a Variable
with an internal name will be created. The CausalStructure
instance allows the user to train and evaluate the underlying Graph
without caring about the internal details.
Basic Usage#
The most important methods of the CausalStructure
class are train
, predict
and evaluate_objective
. Let’s go through a minimal example.
Ideally the data are a pandas DataFrame
.
[8]:
data = pd.DataFrame(columns=["a", "b", "c", "d"], index=range(5))
data[["a", "b"]] = np.random.randn(5,2)
data["c"] = data["a"] + 0.5 * data["b"]
data["d"] = data["a"]**2
data
[8]:
a | b | c | d | |
---|---|---|---|---|
0 | 0.0416837 | 1.07189 | 0.577627 | 0.00173753 |
1 | -0.0564439 | -0.379637 | -0.246262 | 0.00318592 |
2 | -1.06954 | 0.0600041 | -1.03953 | 1.14391 |
3 | 0.526029 | 0.253401 | 0.65273 | 0.276706 |
4 | 0.153498 | 0.285647 | 0.296322 | 0.0235616 |
We instanciate the CausalStructure
providing the assumed dependecy structure of the columns of your data frame. In this case we say columns “a” and “b” influence columns “c” and “d”.
[9]:
cs = CausalStructure([[{"a", "b"}, {"c", "d"}]])
cs
[9]:
CausalStructure([[{'a', 'b'}, 'c'],
[{'a', 'b'}, 'd']])
We train our causal structure by simply executing
[10]:
cs.train(data)
We can now get predictions from the underlying trained graph.
[11]:
input_data = data[["a","b"]]
input_data
[11]:
a | b | |
---|---|---|
0 | 0.0416837 | 1.07189 |
1 | -0.0564439 | -0.379637 |
2 | -1.06954 | 0.0600041 |
3 | 0.526029 | 0.253401 |
4 | 0.153498 | 0.285647 |
[12]:
prediction = cs.predict(input_data)
prediction
[12]:
c | a | b | d | |
---|---|---|---|---|
0 | 0.578952 | 0.041684 | 1.071887 | -0.011787 |
1 | -0.234416 | -0.056444 | -0.379637 | 0.018212 |
2 | -1.018916 | -1.069536 | 0.060004 | 1.136911 |
3 | 0.616478 | 0.526029 | 0.253401 | 0.257356 |
4 | 0.301972 | 0.153498 | 0.285647 | 0.072325 |
The prediction always returns the values for all internal variables, not only the outputs.