Probability Estimation#
[1]:
%%capture
# execute the creation & training notebook first
%run "02-01-creation_and_training.ipynb"
# execute the outlier detection notebook
%run "02-05-outlier_detection.ipynb"
In the outlier detection section we saw how to detect outliers in a test data set and how the outlier threshold influenced the detection results. In the rank estimation section the underlying method of rank estimation was explained.
In this section we take a look at the .estimate_probabilities
method, which calls the ProbabilityEstimator
, the basis for both outlier detection and rank estimation.
The ProbabilityEstimator
is used by the RankEstimator
under the hood to estimate the probability of individual data points. Let us apply the probability estimator to the modified test data set from the outlier detection example.
[2]:
log_probabilities = causal_structure.estimate_probabilities(data=mod_test_data)
log_probabilities
[2]:
(a) | (b|a) | (c|a,b) | graph | |
---|---|---|---|---|
0 | 1.222395 | -1.602953 | 1.237068 | 0.855358 |
1 | 1.413743 | -1.012356 | 1.323633 | 1.725051 |
2 | 1.113812 | -0.994486 | 1.244645 | 1.363961 |
3 | 0.006604 | -1.219309 | 0.964815 | -0.252899 |
4 | 1.405377 | -0.984049 | 1.322710 | 1.744055 |
... | ... | ... | ... | ... |
95 | 0.431912 | -1.019023 | 0.986077 | 0.398876 |
96 | 1.329171 | -1.179034 | 1.279977 | 1.429986 |
97 | 1.343442 | -1.046909 | 1.310268 | 1.606940 |
98 | 1.408012 | -1.011526 | 1.315708 | 1.712243 |
99 | 1.405317 | -1.422481 | 1.324429 | 1.307159 |
100 rows × 4 columns
The numbers that are returned are the logarithmic probability densities. These values are unbound. They can assume floating point value.
Since they are densities their individual values do not tell us much about the likelihood of a given event. Only the comparison to other probability density values for the same parameter provides some insight.
For example let us extract the 5 data points with the lowest probability for the whole graph.
[3]:
log_probabilities.sort_values("graph").iloc[:5]
[3]:
(a) | (b|a) | (c|a,b) | graph | |
---|---|---|---|---|
50 | 0.987680 | -13.097823 | -245.485737 | -262.044845 |
60 | 0.766149 | -1.219600 | -24.933954 | -25.538929 |
74 | -1.950167 | -1.359456 | 0.267683 | -3.049903 |
79 | -0.471609 | -2.757013 | 0.663875 | -2.556750 |
13 | -0.325763 | -2.460974 | 0.718113 | -2.064678 |
We see that the modified data points 50 and 60 indeed have very low logarithmic probability densities. Much lower than any of the unmodified data points. This is why they were easily picked up by the OutlierDetector
.
One more thing is noticable when looking at the logarithmic probability densities. The sum of the ‘(a)’, ‘(b|a)’ and ‘(c|a,b)’ columns seems to be close to the ‘graph’ column.
[4]:
log_probability_sums = log_probabilities[['(a)', '(b|a)', '(c|a,b)']].sum(axis=1)
log_probabilities_with_sum = log_probabilities.copy()
log_probabilities_with_sum["sum of '(a)', '(b|a)', '(c|a,b)'"] = log_probability_sums
log_probabilities_with_sum.sort_values("graph").iloc[:5]
[4]:
(a) | (b|a) | (c|a,b) | graph | sum of '(a)', '(b|a)', '(c|a,b)' | |
---|---|---|---|---|---|
50 | 0.987680 | -13.097823 | -245.485737 | -262.044845 | -257.595880 |
60 | 0.766149 | -1.219600 | -24.933954 | -25.538929 | -25.387404 |
74 | -1.950167 | -1.359456 | 0.267683 | -3.049903 | -3.041940 |
79 | -0.471609 | -2.757013 | 0.663875 | -2.556750 | -2.564747 |
13 | -0.325763 | -2.460974 | 0.718113 | -2.064678 | -2.068623 |
This is not by accident. The chain rule (or product rule) of conditional probabilities tells us that the product of the (correctly conditioned) probability densities is the total probability density. The graph column represents the total logarithmic probability density and for logarithmic probability densities the product becomes a sum.
The reason why the sum and the graph column do not match exactly is that the graph also contains the probabilites of the trained regression parameters, which enter the total sum.
For further details about the ProbabilityEstimator
see the corresponding section in the core-documentation.