Estimate ranks with the RankEstimator#

The Halerium objective class ProbabilityEstimator can be used to compute probabilities of events. That class yields logarithmic probability densities. However, such values are sometimes not easy to judge whether they indicate a typical or very unlikely event.

The Halerium objective class RankEstimator computes the probabilities of events and ranks these among a population of hypothetical random events. The resulting rank of an event is the fraction of these random events that have the same or lower probabilitiy than the event in question. Event ranks returned by the RankEstimator are thus numbers ranging from 0 to 1. Values close to 0 indicate a rare event. Values close to 1 indicate a common event.

Ingredients#

To estimate ranks, you only need a graph and data.

Imports#

[1]:
# for handling data:
import numpy as np

# for plotting:
import matplotlib.pyplot as plt

# for some Halerium specific functions:
import halerium.core as hal

# for building graphs:
from halerium.core import Graph, Entity, Variable, show

# for estimating ranks:
from halerium import RankEstimator

The graph and input data#

Let us define a simple graph:

[2]:
g = Graph("g")
with g:
    Variable("z", mean=0, variance=1)
    Variable("x", mean=z, variance=0.1)
    Variable("y", mean=z, variance=0.1)

show(g)

This graph just contains the three variables x, y, and z. The means of x and y are set to z, and the variances of x and y are much smaller than that of z. Thus, both x and y depend on the variable z and follow it fairly closely.

Now let’s look at a certain set of data.

[3]:
data = {g.z: [-1, 0, 1., 2., 3.]}

The RankEstimator#

We instantiate a RankEstimator.

[4]:
rank_estimator = RankEstimator(graph=g,
                               data=data)

We can now call the rank_estimator with the graph g as argument to estimate the ranks of the events in the data.

[5]:
rank_estimator(g)
[5]:
array([0.506, 0.717, 0.511, 0.141, 0.018])

We can see that values of 2 or 3 for z are uncommon (only a small fraction of random events would have an even lower probability), whereas values of -1, 0, or 1 are more common.

One can call the rank_estimator with any scopetor of the graph.

[6]:
rank_estimator(g.z)
[6]:
array([0.328, 1.   , 0.328, 0.055, 0.003])

The ranks are then estimated regarding only the data of variables in that scopetor.

For example, since we did not specify any data for x, the resulting ranks are 1 (by default, ‘no data’ is interpreted as ‘common’):

[7]:
rank_estimator(g.x)
[7]:
array([0.405, 0.401, 0.389, 0.367, 0.38 ])

Results for multiple graph elements can be obtained using nested structures, e.g. dicts.

[8]:
rank_estimator({'x': g.x, 'y': g.y, 'z': g.z})
[8]:
{'x': array([0.405, 0.401, 0.389, 0.367, 0.38 ]),
 'y': array([0.397, 0.423, 0.412, 0.43 , 0.4  ]),
 'z': array([0.328, 1.   , 0.328, 0.055, 0.003])}

Calling without an argument yields results for the graph, its subgraphs, entities, and variables as a dict.

[9]:
rank_estimator()
[9]:
{'g': array([0.506, 0.717, 0.511, 0.141, 0.018]),
 'g/z': array([0.328, 1.   , 0.328, 0.055, 0.003]),
 'g/x': array([0.405, 0.401, 0.389, 0.367, 0.38 ]),
 'g/y': array([0.397, 0.423, 0.412, 0.43 , 0.4  ])}

Missing data#

Missing values for events can be indicated in the data with np.nan.

[10]:
data = {g.z: [0, 1., np.nan]}
[11]:
rank_estimator = RankEstimator(graph=g,
                               data=data)
rank_estimator(g)
[11]:
array([0.701, 0.492, 0.533])

Data of correlated variables#

Correlations of variables also affects the ranks.

Let’s look at the following data and the resulting ranks.

[12]:
data = {g.x: [0,  1,  1,      1],
        g.y: [0,  1, -1, np.nan]}

rank_estimator = RankEstimator(graph=g,
                               data=data)
display(rank_estimator())
{'g': array([0.882, 0.663, 0.   , 0.519]),
 'g/z': array([0.823, 0.336, 0.82 , 0.349]),
 'g/x': array([0.551, 0.533, 0.002, 0.429]),
 'g/y': array([0.563, 0.552, 0.004, 0.426])}

Whereas events with x and y both having -1, 0, or 1 are estimated common with ranks above 0.5 for g, x = 1 and y = -1 appears almost impossible with a rank close to 0.

Options#

Accuracy#

The accuracy of the RankEstimator is affected by the number of examples used for estimating probabilities and ranks.

Small numbers of examples yield quick results at the expense of accuracy.

[13]:
data = {g.z: [-1, 0, 1., 2., 3.]}
rank_estimator = RankEstimator(graph=g,
                               data=data,
                               n_samples=100)
rank_estimator()
[13]:
{'g': array([0.43, 0.7 , 0.44, 0.07, 0.01]),
 'g/z': array([0.31, 1.  , 0.31, 0.03, 0.  ]),
 'g/x': array([0.34, 0.43, 0.43, 0.4 , 0.43]),
 'g/y': array([0.38, 0.39, 0.34, 0.36, 0.39])}

Many examples yield higher accuracy at the expense of speed and memory (in the current implementeation, the computing time and memory consumption increase quadratically with increasing number of examples).

[14]:
data = {g.z: [-1, 0, 1., 2., 3.]}
rank_estimator = RankEstimator(graph=g,
                               data=data,
                               n_samples=1000)
rank_estimator()
[14]:
{'g': array([0.521, 0.75 , 0.524, 0.14 , 0.018]),
 'g/z': array([0.315, 1.   , 0.315, 0.045, 0.   ]),
 'g/x': array([0.424, 0.437, 0.444, 0.444, 0.42 ]),
 'g/y': array([0.389, 0.397, 0.382, 0.38 , 0.376])}

Scopetors without data#

By default, missing data will be estimated by sampling so that variables and scopetors without any data are still assigned a rank.

(This is consistent with what is returned for data values of np.nan.)

[15]:
data = {g.z: [-1, 0, 1., 2., 3.]}
rank_estimator = RankEstimator(graph=g,
                               data=data)
rank_estimator()
[15]:
{'g': array([0.48 , 0.703, 0.479, 0.149, 0.016]),
 'g/z': array([0.312, 1.   , 0.312, 0.048, 0.002]),
 'g/x': array([0.401, 0.405, 0.405, 0.399, 0.401]),
 'g/y': array([0.397, 0.395, 0.397, 0.392, 0.395])}

Alternatively, one can tell the estimator to marginalize over the missing values. In that case variables and scopetors without any data are considered common, and their ranks are estimated as 1.

Warning: the marginalized method can be very memory consuming.

[16]:
data = {g.z: [-1, 0, 1., 2., 3.]}
rank_estimator = RankEstimator(graph=g,
                               data=data,
                               method="marginalized")
rank_estimator()
[16]:
{'g': array([0.33 , 1.   , 0.338, 0.043, 0.002]),
 'g/z': array([0.33 , 1.   , 0.338, 0.043, 0.002]),
 'g/x': array([1., 1., 1., 1., 1.]),
 'g/y': array([1., 1., 1., 1., 1.])}
[ ]: