Causal Structures and Dependencies#

In many cases the only model information available is the causal structure of the data generating process.

This information can be as simple as “a, b, and c are the inputs and d, e, and f are the outputs.”, but can also be more hierarchical such as “a influences b. b and c influence d.”.

This information can be conveniently stored in a CausalStructure, which represents a collection of dependencies.

Dependencies#

Dependencies are the building blocks of a causal structure.

A single Dependency expresses that an output or a group of outputs depend on an input or a group of inputs.

[1]:

import pandas as pd
import numpy as np

from halerium.causal_structure import Dependency, Dependencies
from halerium import CausalStructure

[2]:

dep = Dependency(inputs={"a", "b", "c"}, outputs={"d", "e", "f"})
print(dep)

Dependency(features={'c', 'a', 'b'}, targets={'e', 'd', 'f'})

You can instanciate a dependency in various ways using positional arguments, various key-word arguments or a simple list or dict. These all do the same:

[3]:

print(Dependency("a", "b"))
print(Dependency(["a", "b"]))
print(Dependency([["a"], ["b"]]))
print(Dependency(features="a", targets="b"))
print(Dependency({"features": "a", "targets": "b"}))

Dependency(features={'a'}, targets={'b'})
Dependency(features={'a'}, targets={'b'})
Dependency(features={'a'}, targets={'b'})
Dependency(features={'a'}, targets={'b'})
Dependency(features={'a'}, targets={'b'})

A dependency must be acyclic in the sense that the same string must not be in both features and targets.

[4]:

try:
    Dependency("a", "a")
except Exception as exc:
    print(repr(exc))

CyclicDependencyError("Cyclic dependency detected for 'a'.")

Multiple dependencies can be grouped in the Dependencies object.

[5]:

Dependencies([
    Dependency("a", "b"),
    Dependency("b", "c"),
    Dependency(["b", "c"], "d"),
])

[5]:

Dependencies([[{'a'}, 'b'],
              [{'b'}, 'c'],
              [{'c', 'b'}, 'd']])

The dependencies will be checked for cyclic dependencies.

[6]:

dependency_list = [
    Dependency("a", "b"),
    Dependency("b", "c"),
    Dependency("c", "a"),
]

try:
    Dependencies(dependency_list)
except Exception as exc:
    print(repr(exc))

CyclicDependencyError("Cyclic dependency detected for {'b'}.")

Dependencies can be instanciated directly from lists:

[7]:

Dependencies([[{'a', 'b', 'c'}, {'d', 'e'}],
              ['c', 'f']])

[7]:

Dependencies([[{'c', 'a', 'b'}, 'e'],
              [{'c', 'a', 'b'}, 'd'],
              [{'c'}, 'f']])

CausalStructure#

The CausalStructure class provides the interface between dependencies, pandas data frames and the low-level Halerium object like Graph and Variable.

The causal structure will build a graph that respects the dependencies and expresses them mathematically via regression. For every element in the Dependencies a Variable with an internal name will be created. The CausalStructure instance allows the user to train and evaluate the underlying Graph without caring about the internal details.

Basic Usage#

The most important methods of the CausalStructure class are train, predict and evaluate_objective. Let’s go through a minimal example.

Ideally the data are a pandas DataFrame.

[8]:

data = pd.DataFrame(columns=["a", "b", "c", "d"], index=range(5))
data[["a", "b"]] = np.random.randn(5,2)
data["c"] = data["a"] + 0.5 * data["b"]
data["d"] = data["a"]**2
data

[8]:

	a	b	c	d
0	0.0416837	1.07189	0.577627	0.00173753
1	-0.0564439	-0.379637	-0.246262	0.00318592
2	-1.06954	0.0600041	-1.03953	1.14391
3	0.526029	0.253401	0.65273	0.276706
4	0.153498	0.285647	0.296322	0.0235616

We instanciate the CausalStructure providing the assumed dependecy structure of the columns of your data frame. In this case we say columns “a” and “b” influence columns “c” and “d”.

[9]:

cs = CausalStructure([[{"a", "b"}, {"c", "d"}]])
cs

[9]:

CausalStructure([[{'a', 'b'}, 'c'],
                 [{'a', 'b'}, 'd']])

We train our causal structure by simply executing

[10]:

cs.train(data)

We can now get predictions from the underlying trained graph.

[11]:

input_data = data[["a","b"]]
input_data

[11]:

	a	b
0	0.0416837	1.07189
1	-0.0564439	-0.379637
2	-1.06954	0.0600041
3	0.526029	0.253401
4	0.153498	0.285647

[12]:

prediction = cs.predict(input_data)
prediction

[12]:

	c	a	b	d
0	0.578952	0.041684	1.071887	-0.011787
1	-0.234416	-0.056444	-0.379637	0.018212
2	-1.018916	-1.069536	0.060004	1.136911
3	0.616478	0.526029	0.253401	0.257356
4	0.301972	0.153498	0.285647	0.072325

The prediction always returns the values for all internal variables, not only the outputs.