For many processes, the exact functional relation between input variables, i.e. the “features”, and output variables, i.e. the “targets”, is not known. In such cases, assuming a linear or low-order polynomial relation between the features and the targets may be a viable approach. The coefficients of the polynomial may then be learned from data. This approach is known as linear or polynomial regression.


We need to import the following packages, classes, and functions.

# for handling data:
import numpy as np
import pandas as pd

# for plotting:
import matplotlib.pyplot as plt

import halerium.core as hal

# for graphs:
from halerium.core import Graph, Entity, Variable, StaticVariable
from halerium.core.regression import linear_regression, polynomial_regression, connect_via_regression

# for models:
from halerium.core import DataLinker, get_data_linker
from halerium.core.model import MAPModel, ForwardModel, Trainer
from halerium.core.model import get_posterior_model

# for predictions:
from halerium import Predictor

# for analysing graphs:
from halerium.core.utilities.print import print_child_tree

Example data

Consider a process with one feature “x” and one target “y”.

We first generate some data for this process:

n_data = 100

x_data = np.random.normal(loc=1, scale=2, size=(n_data,)) + 5
y_data = np.random.normal(loc=0, scale=9, size=(n_data,)) - 12 + 4 * x_data + 3 * x_data**2

data = pd.DataFrame()
data["x"] = x_data
data["y"] = y_data

data.plot.scatter("x", "y");


We can extract simple statistical properties such as the mean and standard deviation:


x_data_mean = data['x'].mean()
x_data_std = data['x'].std()

y_data_mean = data['y'].mean()
y_data_std = data['y'].std()
x y
count 100.000000 100.000000
mean 6.078618 134.753586
std 1.825458 77.458972
min 1.470327 -18.618306
25% 4.843912 78.938936
50% 6.003939 119.147073
75% 7.227908 184.538179
max 10.549351 367.353488

In the following, we assume we do not know the exact functional relation that generated that data.

Linear regression model by hand

Before we discuss how to quickly bulid a regression model using convenience functions, we create a linear regresssion model ‘by hand’.

To build the linear model, we create a Graph containing the variable “x” representing the feature and the variable “y” representing the target. We then add model parameters “slope” and “intercept” to our graph, and use them to connect the feature and target variable:

graph = Graph("graph")
with graph:
    x = Variable("x", shape=(), mean=x_data_mean, variance=x_data_std**2)
    y = Variable("y", shape=(), variance=y_data_std**2)

    slope = StaticVariable("slope", mean=0, variance=1e4)
    intercept = StaticVariable("intercept", mean=0, variance=1e4)

    y.mean = slope * x + intercept

We now train a MAPModel with the data.

model = MAPModel(graph=graph,
                 data={graph.x: data["x"], graph.y: data["y"]})
inferred_slope = model.get_means(graph.slope)
inferred_intercept = model.get_means(graph.intercept)

print("inferred slope =",inferred_slope)
print("inferred intercept =",inferred_intercept)
inferred slope = 40.32779298524231
inferred intercept = -109.72532491801661

We extract the posterior graph from the trained model and build a ForwardModel with it to compute predictions.

posterior_graph = model.get_posterior_graph("posterior_graph")

prediction_model = ForwardModel(graph=posterior_graph,
                                data={posterior_graph.x: data["x"]})

y_linreg_prediction = prediction_model.get_means(posterior_graph.y)

data["y_linreg"] = y_linreg_prediction

ax = data.plot.scatter("x", "y", color='blue', label="true");
ax = data.plot.scatter("x", "y_linreg", color='red', label="lin.reg.pred.", ax=ax);

The plot shows that our linear regression model correctly predicts the trend seen in the true data for values of x between 3 and 9. For smaller or larger values, the predictions are significantly off due to the curvature in the true relation between x and y. To also capture that curvature in the data, we need to go beyond a linear model.

Regression model using convenience functions

A linear regression model with just one scalar feature and one scalar target can be quickly built in the manner described in the previous section. However, building a regression model using beyond-linear polynomials, multi-dimenensional features and targets, and/or multiple features and targets that way can become very involved very quickly. To facilitate building more complex regression models, we can use the convenience function connect_via_regression.

For example, this creates a graph for a quadratic regression model:

graph = Graph("graph")
with graph:
    x = Variable("x", shape=(), mean=x_data_mean, variance=x_data_std**2)
    y = Variable("y", shape=(), variance=y_data_std**2)

result_shape = ()

Using a Trainer and a Predictor, we can compute the predictions from our quadratic model and compare them to the linear predctions:

trainer = Trainer(graph=graph, data={graph.x: data["x"], graph.y: data["y"]})
│ └─location
│   ├─slope
│   └─intercept
predictor = Predictor(graph=trainer(), data={graph.x: data["x"]})

y_prediction = predictor(graph.y)

data["y_quadreg"] = y_prediction

ax = data.plot.scatter("x", "y", color='blue', label="true");
ax = data.plot.scatter("x", "y_linreg", color='red', label="lin.reg.pred.", ax=ax);
ax = data.plot.scatter("x", "y_quadreg", color='green', label="quad.reg.pred.", ax=ax);

More on connect_via_regression

The first argument name_prefix is used to name the entities in the graph holding the regression parameters:

│ └─location
│   ├─slope
│   └─intercept

Printing the graph’s children reveals that besides regression parameters for the mean of y in regression_location_y, there are also regression parameters for the variance of y in regression_log_scale_y. Thus, the mean of y as well as the residual scatter of y is learned as a quadratic function of x.

The argument inputs specifies the input variables for the regression. This can be either a single variable or a list of variables.

The argument outputs specifies the output variables of the regression, either as a single variable, or a list of variables.

The argument order specifies the order of the regression, i.e. the highest power of the input variables in the regression polynomial. For example, order=1 yields linear regression.

The argument include_cross_terms specifies whether to include cross terms when order>1. Cross terms are not enabled by default, since this can significantly increase the number of model parameters and thereby make the model hard to train without overfitting.

The arguments inputs_location=x_data_mean, inputs_scale=x_data_std, outputs_location=y_data_mean, and outputs_scale=y_data_std allow one to directly include scaling of the data for standardization in the regression. Standardization of the data would otherwise be a required step of data preparation. Here, one just needs to provide the empirical location and scale parameters of the data.

More on these and a numbe of further arguments can be found in the documentation for connect_via_regression.

[ ]: