Distributions#

Imports#

First let us import the required packages, classes, and functions.

[1]:

import numpy as np
import matplotlib.pyplot as plt

import halerium.core as hal

from halerium.core import Variable, Graph
from halerium.core.distribution import (
    BernoulliDistribution, DiracDistribution, CategoricalDistribution,
    ExponentialDistribution, LaplaceDistribution, LogNormalDistribution, NoDistribution,
    NormalDistribution, PoissonDistribution, UniformDistribution
)

from halerium.core import get_generative_model


By using this Application, You are agreeing to be bound by the terms and conditions of the Halerium End-User License Agreement that can be downloaded here: https://erium.de/halerium-eula.txt

Variables and distributions#

Halerium variables can follow various kinds of parametrized distributions.

For a variable, the class of the distribution that it follows is fixed upon creation of the variable instance. By default, variables have a normal distribution (which is likely the right choice in most cases):

[2]:

v = Variable("v")
type(v.distribution)

[2]:

halerium.core.distribution.normal_distribution.NormalDistribution

The distribution can also be explicitly stated when creating a variable.

This creates a log-normal distributed variable:

[3]:

v = Variable("v", distribution=LogNormalDistribution)
type(v.distribution)

[3]:

halerium.core.distribution.log_normal_distribution.LogNormalDistribution

A variable’s distribution and data tye have to be compatible:

[4]:

for distribution in ("BernoulliDistribution", "CategoricalDistribution", "DiracDistribution",
                     "ExponentialDistribution", "LaplaceDistribution", "LogNormalDistribution",
                     "NormalDistribution", "PoissonDistribution", "UniformDistribution"):
    for dtype in ("bool", "float", "int"):
        try:
            v = Variable("v", distribution=distribution, dtype=dtype)
            print(f" {distribution} and {dtype} are compatible.")
        except:
            print(f" {distribution} and {dtype} are not compatible.")

 BernoulliDistribution and bool are compatible.
 BernoulliDistribution and float are not compatible.
 BernoulliDistribution and int are not compatible.
 CategoricalDistribution and bool are not compatible.
 CategoricalDistribution and float are not compatible.
 CategoricalDistribution and int are not compatible.
 DiracDistribution and bool are compatible.
 DiracDistribution and float are compatible.
 DiracDistribution and int are compatible.
 ExponentialDistribution and bool are not compatible.
 ExponentialDistribution and float are compatible.
 ExponentialDistribution and int are not compatible.
 LaplaceDistribution and bool are not compatible.
 LaplaceDistribution and float are compatible.
 LaplaceDistribution and int are not compatible.
 LogNormalDistribution and bool are not compatible.
 LogNormalDistribution and float are compatible.
 LogNormalDistribution and int are not compatible.
 NormalDistribution and bool are not compatible.
 NormalDistribution and float are compatible.
 NormalDistribution and int are compatible.
 PoissonDistribution and bool are not compatible.
 PoissonDistribution and float are not compatible.
 PoissonDistribution and int are compatible.
 UniformDistribution and bool are not compatible.
 UniformDistribution and float are compatible.
 UniformDistribution and int are not compatible.

While the variable’s distribution class has to be decided at variable creation, the variable’s distribution parameters can be set either at variable creation, e.g.

[5]:

v = Variable("v", distribution=NormalDistribution, mean=0, variance=1)

or at a later point:

[6]:

v = Variable("v", distribution=NormalDistribution)
v.mean = 0
v.variance = 1

However, the distribution parameters need to be set eventually (unless a variable is fully determined by data). Creating a model from a graph containing a variable with a missing distribution parameter may raise an exception:

[7]:

with Graph("g") as g:
    Variable("v", distribution=NormalDistribution)

try:
    get_generative_model(graph=g)
except Exception as e:
    print("Error:", e)

Error: Variable <halerium.Variable 'g/v'> with NormalDistribution without mean is not fully determined by data. Variables with distribution type NormalDistribution must have a mean or be fully determined by data.

Distribution classes#

Normal distribution#

This creates a variable with a normal distribution:

[8]:

v = Variable("v", distribution=NormalDistribution, mean=0, variance=1)

The parameters of a NormalDistribution are ‘mean’ and ‘variance’, i.e. the mean and variance, resp., of the distribution:

[9]:

v.distribution.parameter_names

[9]:

{'mean', 'variance'}

The mean can take any real value. The variance must be a positive number, otherwise the variable’s value is not-a-number.

The data type of a normally distributed variable is ‘float’ by default (but can also be ‘int’, in which case the value is rounded to the nearest integer):

[10]:

v.dtype

[10]:

'float'

Log-normal distribution#

This creates a variable with a log-normal distribution:

[11]:

v = Variable("v", distribution=LogNormalDistribution, mean_log=0, variance_log=1)

The parameters of a LogNormalDistribution are ‘mean_log’ and ‘variance_log’, i.e. the mean and variance, resp., of the underlying normal distribution:

[12]:

v.distribution.parameter_names

[12]:

{'mean_log', 'variance_log'}

The parameter mean_log can take any real value. The parameter variance_log must be a positive number, otherwise the variable’s value is not-a-number.

The data type of a log-normally distributed variable is ‘float’:

[13]:

v.dtype

[13]:

'float'

Laplace distribution#

This creates a Laplace distribution:

[14]:

v = Variable("v", distribution=LaplaceDistribution, mean=0, variance=1)

The parameters od the Laplace distribution are ‘mean’ and ‘variance’ (rather than the commonly used ‘scale’ = \(\sqrt{\text{variance}/2}\)):

[15]:

v.distribution.parameter_names

[15]:

{'mean', 'variance'}

The mean can take any real value. The variance must be a positive number, otherwise the variable’s value is not-a-number.

The data type of a Laplace distributed variable is ‘float’:

[16]:

v.dtype

[16]:

'float'

Uniform distribution#

This creates a variable with a uniform distribution:

[17]:

v = Variable("v", distribution=UniformDistribution, center=0, width=1)

The parameters of a UniformDistribution are ‘center’ and ‘width’:

[18]:

v.distribution.parameter_names

[18]:

{'center', 'width'}

The center can be any real number, the width must be positive. The distribution’s possible values then lie in the interval [center - width/2, center + width/2].

The data type of a uniform distributed variable is ‘float’:

[19]:

v.dtype

[19]:

'float'

Exponential distribution#

This creates a variable with a exponential distribution:

[20]:

v = Variable("v", distribution=ExponentialDistribution, mean=2.)

The parameter of an exponential distribution is either the ‘mean’ or ‘rate’ (i.e. the inverse of the mean):

[21]:

v.distribution.parameter_names

[21]:

{'mean', 'rate'}

The mean or rate can be any positive number. One can only specify one of the two parameters. The other is then set automatically to the corresponding value.

The data type of an exponentially distributed variable is ‘float’:

[22]:

v.dtype

[22]:

'float'

Dirac distribution#

This creates a variable with a Dirac distribution:

[23]:

v = Variable("v", distribution=DiracDistribution, mean=0.5)

The parameter of a DiracDistribution is the ‘mean’:

[24]:

v.distribution.parameter_names

[24]:

{'mean'}

The only possible value of a Dirac distributed variable is the value of its mean:

[25]:

v.evaluate()

[25]:

array(0.5)

By default, data type of a Dirac distributed variable is ‘float’:

[26]:

v.dtype

[26]:

'float'

For variables with data type ‘float’, the mean can be any real number. For variables with data type ‘bool’, the mean must be boolean, too. For integer variables, the mean must be integer.

This creates a Dirac distributed boolean:

[27]:

v = Variable("v", distribution=DiracDistribution, dtype='bool', mean=True)
v.evaluate()

[27]:

array(True)

This creates a Dirac distributed integer:

[28]:

v = Variable("v", distribution=DiracDistribution, dtype='int', mean=3)
v.evaluate()

[28]:

array(3)

Bernoulli distribution#

This creates a variable with a Bernoulli distribution:

[29]:

v = Variable("v", distribution=BernoulliDistribution, mean=0.5)

The parameter of a BernoulliDistribution is either ‘logit’ or ‘mean’:

[30]:

v.distribution.parameter_names

[30]:

{'logit', 'mean'}

The logit can be any real number. The mean must be between 0 and 1. The logit and mean are not independent parameters. Thus one can only specify either the logit or the mean. The other parameter is then automatically set to the corresponding value.

The data type of a uniform distributed variable is ‘bool’:

[31]:

v.dtype

[31]:

'bool'

Categorical distribution#

This creates a variable with a categorical distribution:

[32]:

v = Variable("v", distribution=CategoricalDistribution, n_categories=3,
             probabilities=[0.5, 0.3, 0.2])

For a variable following a categorical distribution, it is mandatory to specify the number of categories n_categories when creating the variable.

The parameter of a CategoricalDistribution is either ‘probabilities’ or ‘logits’:

[33]:

v.distribution.parameter_names

[33]:

{'logits', 'probabilities'}

The logits and probabilities are not independent parameters. Thus one can only specify either the logits or the probabilities. The other parameter is then automatically set to the corresponding value.

The logits and probabilities must have at least one dimension. The size of the last dimension must match the number of categories, since each entry in that last dimension corresponds to one of the categories. The remaining dimensions must be compatible with the variable shape. The entries of the logits and probabilities can take any real values, since the logits and probabilities are treated as unnormalized.

The data type of a categorical variable is ‘int’:

[34]:

v.dtype

[34]:

'int'

The values represent the index of the category. Thus valid values must be in the range 0, 1, …, n_categories - 1. Negative values or values as large as or larger than the number of categories are not valid.

Poisson distribution#

This creates a variable with a Poisson distribution:

[35]:

v = Variable("v", distribution=PoissonDistribution, mean=3.5)

The parameter of a PoissonDistribution is the ‘mean’:

[36]:

v.distribution.parameter_names

[36]:

{'mean'}

The mean can be any real number.

The data type of a Poisson distributed variable is ‘int’:

[37]:

v.dtype

[37]:

'int'