The CausalStructure class#

Aliases#

halerium.CausalStructure
halerium.core.CausalStructure
halerium.core.causal_structure.CausalStructure
class CausalStructure(dependencies, graph_name='graph', internal_name_prefix='var_')#

The causal structure class.

The causal structure class.

A causal structure is created using a set of dependencies and allows the user the convenient creation, training and evaluation of Graphs.

The dependencies are provided to the class as either a Dependencies instance, a single Dependency instance or a as iterable containing Dependency instances or the following:

  • list-like elements -

Lists or tuples of length 2 in which the first entry contains the feature(s) and the second entry the target(s).

  • dict-like elements -

Dicts of length 2 with the features as the value of a key named “feature”, “features”, “input”, “inputs”, “cause”, or “causes” and the target as the value of a key named “target”, “targets”, “output”, “outputs”, “effect”, or “effects”.

Parameters:
  • dependencies (Dependencies, Iterable) – The dependencies that define the causal structure. Can be either a Dependencies instance or a single Dependency instance or a list-like iterable containing Dependency instances, or list-like or dict-like elements which are valid __init__ arguments for the Dependency class.

  • graph_name (str, optional) – The name to be given to the constructed graph. Must be a valid python variable name. The default is “graph”.

  • internal_name_prefix (str, optional) – The prefix which is prepended to the variable names constructed from the strings contained in dependencies. Must be a valid python variable name. The default is “var_”.

Raises:

CyclicDependencyError – If the dependencies are not acyclic an exception is raised.

Examples

>>> deps = [
>>>     ["a", ["b", "c"]],
>>>     [["b", "c"], "d"],
>>> ]
>>> c = CausalStructure(deps)
>>> deps = [
>>>     ["a", ["b", "c"]],
>>>     {"features": ["b", "c"], "target": "d"},
>>> ]
>>> c = CausalStructure(deps)
>>> deps = [
>>>     ["Max Mustermann", ["Erika Musterfrau", "%XZRB 12"]],
>>>     [["Erika Musterfrau", "%XZRB 12"], "Odysseus von T||a"],
>>> ]
>>> c = CausalStructure(deps)
assert_is_external_name(name)#

Assert is external name.

Check that a name is a known external name.

Parameters:

name (str) – The name to check.

Returns:

name – The name to check.

Return type:

str

Raises:

ValueError – If not a known external name.

build_graph(method='regression', scaling_data=None, variances='learn', **kwargs)#

Build a halerium Graph instance that follows the causal structure.

The causal dependencies are mathematically connected as specified by the arguments.

Parameters:
  • method ("regression" or "gaussian_process", optional) – The method to build the mathematical dependencies. The default is “regression”.

  • scaling_data (pandas.DataFrame or dict, optional) – Data which are used to estimate the location and scale of each parameter contained in the dependencies. If no scaling data are provided the locations are assumed to be 0 and the scales to be 1. The default is None.

  • variances ("learn" or float, optional) – If “learn”, the variances are learned. If a float is given it is considered to be the the value of the variances w.r.t to the parameter scales.

  • **kwargs – Other keyword arguments are passed to the regression.connect_via_regression or to the regression.connect_via_gaussian_process function respectively.

copy()#

Returns a copy of the self. Scaling data are not copied. The copy will share a reference to the scaling data instead.

Returns:

copy_of_self – The copied instance.

Return type:

CausalStructure

detect_outliers(data, compiler=None, n_samples=1000, outlier_threshold=0.05, method='upsampled', fetches=None)#

Detect outliers.

The rank is estimated for each data point individually and compared against a threshold. If the rank falls below the threshold, the event is classified as an outlier. The rank of a data point is defined as the fraction of a random sample expected to have the same or lower probability than the data point in question. So rank values range between 0 and 1.

Parameters:
  • data (dict, halerium.core.DataLinker) – The data in which to identify outliers. Either dictionary with variables as keys and data arrays as values, or a DataLinker holding links to the variables in graph.

  • compiler (halerium.core.compiler.compiler_base.CompilerBase) – The backend compiler to be used. The default is the Tensorflow compiler.

  • n_samples (int, optional) – The amount of samples to be used to estimate the probabilities. The default is 1000.

  • outlier_threshold (float, optional) – The rank value below which to classify an event as outlier.

  • method (str, optional,) – The method with which the probability is estimated. Either “marginalized” or “upsampled”. “marginalized” marginalizes the missing values, so that the result represents a probability density only over the variables that have data. “upsampled” samples the missing values, so that the result represents a probability density over all variables. The “marginalized” method is slower and more memory intensive. The dafault is “upsampled”.

  • fetches (str, tuple, list, or dict of str, optional) – The names of variables to detect outliers.

Returns:

outliers – Whether the events are outliers.

Return type:

pandas.DataFrame

dump_dict()#

Convert causal structure to specification dict. A specification dict is dictionary that contains only JSON-compatible standard types (list, dict, string, float) and fully specifies all properties of the causal structure. This includes the causal structures graph and trained_graph if they are available.

Returns:

specification – A dictionary containing the specifications for self.

Return type:

dict

dump_file(fp)#

Serialize the causal_structure into a JSON formatted stream to fp.

Parameters:

fp (filename or file-like object) – If a string is provided a file with that name is created. Otherwise a file-like object (supporting .write()) is expected to which the serialized specification is written.

Return type:

None.

dump_string()#

Serialize the causal_structure into a JSON formatted str.

Returns:

serialized_spec – The serialized JSON formatted str containing the specifications for self.

Return type:

str

ensure_graph()#

Ensure graph is built.

If the graph is not built yet, build it with the default arguments of the build_graph method.

estimate_influences(target, compiler=None, n_samples=1000, fetches=None)#

Estimate influences.

This method calculates the amount by which variables in the causal structure influence the value of a target.

Parameters:
  • target (str) – The target variable name for which the influences are to be assessed.

  • compiler (halerium.core.compiler.compiler_base.CompilerBase) – The compiler compiling the model. The default is the TFCompiler.

  • n_samples (int, optional) – The amount of samples that are generated to estimate influence strengths. The default is 1000.

  • fetches (str, dict, list or tuple of str, optional) – The variable names for which to estimate their influences on the target.

Returns:

influences – The influence strengths.

Return type:

pandas.Series

estimate_probabilities(data, compiler=None, n_samples=1000, method='marginalized', fetches=None)#

Estimate probabilities.

Estimate the log. probability of a set of data points for the causal structure. The probability is estimated for each data point individually.

Parameters:
  • data (dict, halerium.core.DataLinker) – The data for which to estimate probabilities. Either dictionary with variables as keys and data arrays as values, or a DataLinker holding links to the variables in graph.

  • compiler (halerium.core.compiler.compiler_base.CompilerBase, optional) – The backend compiler to be used. The default is the Tensorflow compiler.

  • n_samples (int, optional) – The amount of samples to be used to estimate the probabilities. The default is 1000.

  • method (str, optional,) – The method with which the probability is estimated. Either “upsampled” or “marginalized”. “upsampled” samples the missing values, so that the result represents a probability density over all variables. “marginalized” marginalizes the missing values, so that the result represents a probability density only over the variables that have data. The default is “marginalized”.

  • fetches (str, dict, list or tuple of str, optional) – The names of variables to estimate probabilities for, If no fetches are provided, the default is to return estimates for all variables in the causal structure.

Returns:

log_probabilities – The log-probabilities for the fetches.

Return type:

pd.DataFrame

estimate_ranks(data, compiler=None, n_samples=1000, method='upsampled', fetches=None)#

Estimate ranks.

Estimate the ranks of a set of data points for the causal structure. The rank is estimated for each data point individually. The rank of a data point here is defined as the fraction of a random sample expected to have the same or lower probability than the data point in question. So rank values range between 0 and 1. Very unlikely events have a rank close to 0. The most likely events have a rank close to 1.

Parameters:
  • data (dict, halerium.core.DataLinker) – The data for which to estimate ranks. Either dictionary with variables as keys and data arrays as values, or a DataLinker holding links to the variables in graph.

  • compiler (halerium.core.compiler.compiler_base.CompilerBase) – The backend compiler to be used. The default is the Tensorflow compiler.

  • n_samples (int, optional) – The amount of samples to be used to estimate the probabilities. The default is 1000.

  • method (str, optional,) – The method with which the probability is estimated. Either “marginalized” or “upsampled”. “marginalized” marginalizes the missing values, so that the result represents a probability density only over the variables that have data. “upsampled” samples the missing values, so that the result represents a probability density over all variables. The “marginalized” method is slower and more memory intensive. The dafault is “upsampled”.

  • fetches (str, dict, list or tuple of str, optional) – The names of variables used to estimate ranks for, If no fetches are provided, the default is to use all variables in the causal structure.

Returns:

ranks – The ranks computed.

Return type:

pd.DataFrame

evaluate(data, inputs, outputs=None, method='MAPFisher', compiler=None, model_args=None, solver_args=None, metric='r2', n_samples=100, fetches=None)#

Evaluate the prediction accuracy of the causal structure.

Parameters:
  • data (dict, pandas.DataFrame) – The data for the evaluation. The data will be used by the evaluator either as input for computing predictions or as true data to compare the prediction outputs for computing score. Either pandas.DataFrame or dictionary with variable names as keys and data arrays as values`.

  • inputs (Iterable) – The names of variables in the causal structure to be considered as input (i.e. known features with data) for the predictions. Any variable in the inputs must also be in the data (though non-finite values in the data are permitted), and cannot be in the outputs.

  • outputs (Iterable, optional) – The names of variables in the causal structure to be considered as output (i.e. targets) for the predictions. The prediction outputs are then compared to the data to compute the scores. Any variable in the outputs must also be in the data (though non-finite values in the data are permitted), and cannot be in the inputs. The default is None, in which case the complement of inputs within the data is used as outputs.

  • method ({'Forward', 'MAPFisher', 'MAP', 'ADVI', or 'MGVI'}, optional) – The solving method. According to the chosen method, either a ForwardModel , MAPFisherModel, MAPModel, ADVIModel, or MGVIModel instance is used for predictions. See the corresponding model class for further information. The default is ‘MAPFisher’.

  • compiler (halerium.core.compiler.compiler_base.CompilerBase, optional) – Which compiler to use to create the numerical arrays of the model. If None, a TFCompiler instance is created. The default is None.

  • model_args (dict, optional) – Model arguments that depend on the specified method. See the corresponding model classes for further information.

  • solver_args (Dict, optional) – The arguments to pass to the model’s solver function. The default is None.

  • metric (str, callable, tuple, list, dict of str or callable, optional) – The evaluation metric to use. For a str, recognized values are: “mae” : mean absolute error, “mse” : mean square error, “rmse” : root mean square error, “r_mae” : relative mean absolute error, “r_mse” : relative mean square error, “r_rmse” : relative root mean square error, “r2” : coefficient of determination (r2 score). If a callable, it should take two numpy.ndarray and a callable as arguments, first the true values, second the predicted values, third the reduction function to apply to the component-wise result. If a tuple, list, or dictionary, its values should be a recognized str or callable.

  • n_samples (int, optional) – The number of examples to compute predictions. The default is 100.

  • fetches (str, tuple, list, or dict of str, optional) – The names of variables to compute the metrics for. Note that metrics are only computed for variables for which true data has been provided that is not also input data.

Returns:

The scores for the graph elements.

Return type:

scores

evaluate_objective(objective_class, data=None, fetches=None, **kwargs)#

Evaluate objective.

Evaluate an Objective class using the trained graph of the causal structure.

Parameters:
  • objective_class (halerium.Objective) – The objective class to be evaluated

  • data (pandas.DataFrame or dict, optional) – The data used to evaluate the objective. Either None, a pandas.DataFrame with variable names as columns, or a dictionary with variable names as keys and data arrays as values`. The default is None.

  • fetches – The fetches the objective is applied to. Either None (in which case all variables will be used), or a name or names of variables (in a list, dict, or similar) in the causal structure. The default is None.

  • **kwargs – The keyword arguments passed to the objective class.

Returns:

The results of the objective evaluation.

Return type:

result

classmethod from_specification(specification=None, file=None)#

Create a causal_structure instance from a (serialized) specification or from a file containing a serialized specification. Optionally, the top-level name can be replaced.

Parameters:
  • specification (dict or str, optional) – Can be either a specification dict or a serialized specification dict in a JSON formatted string or file-like object supporting .read().

  • file (file-handle or file name, optional) – Can be either a file-like object supporting .read() or a file path to a file containing a JSON formatted string of a specification. Will be ignored if specification is provided.

Returns:

A causal structure instance.

Return type:

CausalStructure

get_data_linker(data)#

Get data linker.

Get a DataLinker instance linking the provided data to the corresponding variables in the Graph.

get_graph()#

Get graph.

Get the built graph. If the graph was not built yet, it is built with the default arguments of the build_graph method.

get_trained_graph()#

Get trained graph.

Get the trained Graph, if it exists.

Raises:

RuntimeError – If the Graph has not been trained yet an Error is raised.

predict(data=None, method='MAPFisher', compiler=None, model_args=None, solver_args=None, n_samples=100, return_std=False, fetches=None)#

Predict.

Make a prediction using the trained graph of the causal structure.

Parameters:
  • data (pandas.DataFrame or dict) – The input data for the predictions.

  • method ({'Forward', 'MAPFisher', 'MAP', 'ADVI', or 'MGVI'}, optional) – The solving method. According to the chosen method, either a ForwardModel , MAPFisherModel, MAPModel, ADVIModel, or MGVIModel instance is created. See the corresponding model class for further information. The default is ‘MAPFisher’.

  • compiler (halerium.core.compiler.compiler_base.CompilerBase, optional) – Which compiler to use to create the numerical arrays of the model. If None, a TFCompiler instance is created. The default is None.

  • model_args (dict, optional) – Model arguments that depend on the specified method. See the corresponding model classes for further information.

  • solver_args (dict, optional) – The arguments to pass to the model’s solver function. The default is None.

  • n_samples (int, optional) – The number of examples to compute. The default is 100.

  • return_std (bool, optional) – Whether or not to return the standard deviation as well, or just the mean.

  • fetches – The names of variables to compute predictions for. The default is None.

Returns:

  • predicted_mean – The predicted mean values.

  • results_std – The predicted standard deviations. Only returned if return_std is True.

predict_interventions(data=None, interventions=None, method='MAPFisher', compiler=None, model_args=None, solver_args=None, n_samples=100, return_std=False, fetches=None)#

Predict interventions.

Make predictions for interventions using the trained graph of the causal structure,

Parameters:
  • data (pandas.DataFrame or dict, optional) – The input data for the predictions. The default is None.

  • interventions (pandas.DataFrame or dict, optional) – The intervention data for the predictions. Note that values in interventions override values in data. The default is None.

  • method ({'Forward', 'MAPFisher', 'MAP', 'ADVI', or 'MGVI'}, optional) – The solving method. According to the chosen method, either a ForwardModel , MAPFisherModel, MAPModel, ADVIModel, or MGVIModel instance is created. See the corresponding model class for further information. The default is ‘MAPFisher’.

  • compiler (halerium.core.compiler.compiler_base.CompilerBase, optional) – Which compiler to use to create the numerical arrays of the model. If None, a TFCompiler instance is created. The default is None.

  • model_args (dict, optional) – Model arguments that depend on the specified method. See the corresponding model classes for further information.

  • solver_args (dict, optional) – The arguments to pass to the model’s solver function. The default is None.

  • n_samples (int, optional) – The number of examples to compute. The default is 100.

  • return_std (bool, optional) – Whether or not to return the standard deviation as well, or just the mean.

  • fetches – The names of variables to compute intervention predictions for. The default is None.

Returns:

  • predicted_mean – The predicted mean values.

  • results_std – The predicted standard deviations. Only returned if return_std is True.

reset_graph(reset=True)#

Reset graph.

Resets the built graph. This is important if e.g. the scaling data of distribution classes have been changed.

If reset=False, an already built graph is not discarded and an exception is raised.

Parameters:

reset (bool, optional) – Whether to actually delete an already built graph. The default is True.

property scaling_data#

Data which are used to estimate the location and scale of each parameter contained in the dependencies. Without scaling data the locations are assumed to be 0 and the scales to be 1. Scaling data are not saved if the CausalStructure is saved as a dict, string, of file.

set_distribution_class(name, distribution_class)#

Set distribution class.

Set the distribution class for a variable in the causal structure.

Parameters:
  • name (str) – The variable name for which the distribution class is changed. Must be a string from the dependencies.

  • distribution_class (type, str) – A subclass of DistributionBase.

show_distribution_classes()#

Show distribution classes.

Show the distribution classes of all variables in the causal structure.

Returns:

The keys are strings from the dependencies. The values are distribution classes.

Return type:

dict

to_internal(names)#

Returns the internal graph variables corresponding to the given parameter names.

Parameters:

names (str, dict or list) – The parameter names for which the variables are to be returned.

Returns:

variables – The variable instances corresponding to the strings in names.

Return type:

Variable, dict or list

train(data, update_trained_graph=True, **kwargs)#

Train the causal model’s Graph with training_data.

Parameters:
  • data (pandas.DataFrame or dict) – The training data.

  • update_trained_graph (bool) – Whether to update trained graph. Else start from fresh graph. The default is True.

  • kwargs – The keyword arguments to be passed to the Trainer.