Probability Estimation#

[1]:

%%capture
# execute the creation & training notebook first
%run "02-01-creation_and_training.ipynb"
# execute the outlier detection notebook
%run "02-05-outlier_detection.ipynb"

In the outlier detection section we saw how to detect outliers in a test data set and how the outlier threshold influenced the detection results. In the rank estimation section the underlying method of rank estimation was explained.

In this section we take a look at the .estimate_probabilities method, which calls the ProbabilityEstimator, the basis for both outlier detection and rank estimation.

The ProbabilityEstimator is used by the RankEstimator under the hood to estimate the probability of individual data points. Let us apply the probability estimator to the modified test data set from the outlier detection example.

[2]:

log_probabilities = causal_structure.estimate_probabilities(data=mod_test_data)
log_probabilities

[2]:

	(a)	(b\|a)	(c\|a,b)	graph
0	1.222395	-1.602953	1.237068	0.855358
1	1.413743	-1.012356	1.323633	1.725051
2	1.113812	-0.994486	1.244645	1.363961
3	0.006604	-1.219309	0.964815	-0.252899
4	1.405377	-0.984049	1.322710	1.744055
...	...	...	...	...
95	0.431912	-1.019023	0.986077	0.398876
96	1.329171	-1.179034	1.279977	1.429986
97	1.343442	-1.046909	1.310268	1.606940
98	1.408012	-1.011526	1.315708	1.712243
99	1.405317	-1.422481	1.324429	1.307159

100 rows × 4 columns

The numbers that are returned are the logarithmic probability densities. These values are unbound. They can assume floating point value.

Since they are densities their individual values do not tell us much about the likelihood of a given event. Only the comparison to other probability density values for the same parameter provides some insight.

For example let us extract the 5 data points with the lowest probability for the whole graph.

[3]:

log_probabilities.sort_values("graph").iloc[:5]

[3]:

	(a)	(b\|a)	(c\|a,b)	graph
50	0.987680	-13.097823	-245.485737	-262.044845
60	0.766149	-1.219600	-24.933954	-25.538929
74	-1.950167	-1.359456	0.267683	-3.049903
79	-0.471609	-2.757013	0.663875	-2.556750
13	-0.325763	-2.460974	0.718113	-2.064678

We see that the modified data points 50 and 60 indeed have very low logarithmic probability densities. Much lower than any of the unmodified data points. This is why they were easily picked up by the OutlierDetector.

One more thing is noticable when looking at the logarithmic probability densities. The sum of the ‘(a)’, ‘(b|a)’ and ‘(c|a,b)’ columns seems to be close to the ‘graph’ column.

[4]:

log_probability_sums = log_probabilities[['(a)', '(b|a)', '(c|a,b)']].sum(axis=1)
log_probabilities_with_sum = log_probabilities.copy()
log_probabilities_with_sum["sum of '(a)', '(b|a)', '(c|a,b)'"] = log_probability_sums
log_probabilities_with_sum.sort_values("graph").iloc[:5]

[4]:

	(a)	(b\|a)	(c\|a,b)	graph	sum of '(a)', '(b\|a)', '(c\|a,b)'
50	0.987680	-13.097823	-245.485737	-262.044845	-257.595880
60	0.766149	-1.219600	-24.933954	-25.538929	-25.387404
74	-1.950167	-1.359456	0.267683	-3.049903	-3.041940
79	-0.471609	-2.757013	0.663875	-2.556750	-2.564747
13	-0.325763	-2.460974	0.718113	-2.064678	-2.068623

This is not by accident. The chain rule (or product rule) of conditional probabilities tells us that the product of the (correctly conditioned) probability densities is the total probability density. The graph column represents the total logarithmic probability density and for logarithmic probability densities the product becomes a sum.

The reason why the sum and the graph column do not match exactly is that the graph also contains the probabilites of the trained regression parameters, which enter the total sum.

For further details about the ProbabilityEstimator see the corresponding section in the core-documentation.