Evaluate the quality of a model with the Evaluator

Given a model, important questions for deciding further steps are, for example: - How well does the model capture reality? - Do certain parts of the model require modifications to better explain existing data? - How accurate will predictions from the model be? - Does the model require more data for training to improve the model?

To answer such questions, one needs to quantify the quality of the model. The Evaluator is the objective class in Halerium for quantifying the model quality by comparing predictions with actual data.

To evaluate the quality of a model represented by a graph, an Evaluator instance is created with a graph, data and a list of variables to consider as prediction inputs. Optionally, the prediction output variables and various metrics for evaluation can be specified. That evaluator instance can then be queried by calling it with graph elements as input. The scores, i.e. the evaluation results according to the specified evaluation metrics, for these elements are then returned.

Imports

We first import the required packages, classes, functions, etc.

[1]:
# for handling data:
import numpy as np

# for building graphs:
from halerium.core import Graph, Entity,Variable, show

# for evaluating models:
from halerium import Evaluator

# for predicting:
from halerium import Predictor

The graph and data

Let us define a simple graph.

[2]:
graph = Graph("graph")
with graph:
    Entity("e")
    with graph.e:
        Variable("x", mean=0, variance=1)
    Entity("f")
    with graph.f:
        Variable("y", variance=1)
        Variable("z", variance=1)

    graph.f.y.mean = graph.e.x + 1
    graph.f.z.mean = graph.e.x + 2

show(graph)

Let us create some data for the evaluation, representing ‘real-life’ data. Here, we specify values for graph.e.x, graph.f.y, and graph.f.z in a dictionary:

[3]:
x_data = np.linspace(-1, 1, 1000)
y_data = x_data + 1 + 0.5 * np.random.normal(size=x_data.shape)
z_data = x_data + 2 + 0.5 * np.random.normal(size=x_data.shape)

data = {graph.e.x: x_data, graph.f.y: y_data, graph.f.z: z_data}

We also specify which variables are considered inputs for the predictions in the evaluation:

[4]:
inputs = {graph.e.x}

The evaluator

Now we create an evaluator, providing it with the graph and the data and the inputs:

[5]:
evaluator = Evaluator(graph=graph,
                      data=data,
                      inputs=inputs)

Note, the policy for objectives is to provide as much information as possible at the creation of the objective instance. Then, calling the objective instance only requires as input the graph elements one wishes to obtain information for. For the evaluator the information required is the graph, the data, and which variables to treat as inputs. Optional arguments will be discussed below.

We can now call the evaluator to obtain a score, i.e. the result of the evaluation. By default, the coefficient of determination, also known as the r2 score, is returned (other scores are discussed below). For the variable graph.f.y:

[6]:
evaluator(graph.f.y)
[6]:
0.5843139758804087

The argument of the evaluator call, i.e. graph.f.y in this example, is the target of the evaluation. The evaluator can also be called with nested structures, e.g. to compute scores for several variables.

[7]:
score = evaluator({'x': graph.e.x, 'y': graph.f.y, 'z': graph.f.z})
display(score)

y_score = score['y']
print('score for y:', y_score)
{'x': None, 'y': 0.5843139758804087, 'z': 0.5737885533770486}
score for y: 0.5843139758804087

Meaningful scores can only be computed for variables that have data but are not input variables. For other variables, the result is None. This is the case for graph.e.x here, which was chosen as input variable, and thus is not considered as output variable of the prediction for the computation of scores.

For graph elements other than variables, their score is computed from the scores of their contained variables. For example:

[8]:
evaluator({'e': graph.e, 'f': graph.f})
[8]:
{'e': None, 'f': 0.579051264628729}

Since the entity graph.e does not contain any variable with a score, its own score is None. The entity graph.f contains the two variables graph.f.y and graph.f.z with scores. Its own score is computed by averaging those variables’ scores (other reduction functions are discussed below).

If no target for the evaluation is specified, a dictionary with all graph elements is returned:

[9]:
evaluator()
[9]:
{'graph': 0.579051264628729,
 'graph/e': None,
 'graph/e/x': None,
 'graph/f': 0.579051264628729,
 'graph/f/y': 0.5843139758804087,
 'graph/f/z': 0.5737885533770486}

Visualizing the evaluator

You can add the information of the objective to the graph visualization by using show and then activating the objective’s button in the bottom right of the canvas.

[10]:
show(evaluator)

Options

Specifying targets explicitly

One can specify the outputs for the evaluation explicitly. This may be useful when only select variables should contribute to the scores of enclosing scopetors.

Here, we restrict the outputs to graph.g.y:

[11]:
evaluator = Evaluator(graph=graph,
                      data=data,
                      inputs=inputs,
                      outputs={graph.f.y})

evaluator()
[11]:
{'graph': 0.5843139758804086,
 'graph/e': None,
 'graph/e/x': None,
 'graph/f': 0.5843139758804086,
 'graph/f/y': 0.5843139758804086,
 'graph/f/z': None}

Thus, there is no score for graph.f.z. Moreover, the score for graph.f is only based on the result for graph.f.y.

Specifying the metrics

By default, the evaluator returns scores for the coefficient of determination (a.k.a. r2 score). Other supported metric include, for example, the root mean square error:

[12]:
evaluator = Evaluator(graph=graph,
                      data=data,
                      inputs=inputs,
                      metric='rmse')

evaluator()
[12]:
{'graph': 0.4975957581312331,
 'graph/e': None,
 'graph/e/x': None,
 'graph/f': 0.4975957581312331,
 'graph/f/y': 0.49696902430467926,
 'graph/f/z': 0.4982224919577874}

The documentation of the Evaluator class provides more information on recognized names of supported metrics and how to provide custom metrics.

Several metrics can be provided together in nested structures such as, e.g. a list or dict, and computed at once:

[13]:
evaluator = Evaluator(graph=graph,
                      data=data,
                      inputs=inputs,
                      metric={'error': 'rmse', 'rel. error': 'r_rmse', 'r2': 'r2'})

evaluator()
[13]:
{'graph': {'error': 0.49759575813123313,
  'rel. error': 0.648792945470825,
  'r2': 0.5790512646287292},
 'graph/e': None,
 'graph/e/x': None,
 'graph/f': {'error': 0.49759575813123313,
  'rel. error': 0.648792945470825,
  'r2': 0.5790512646287292},
 'graph/f/y': {'error': 0.49696902430467926,
  'rel. error': 0.6447371744514128,
  'r2': 0.5843139758804086},
 'graph/f/z': {'error': 0.4982224919577874,
  'rel. error': 0.6528487164902382,
  'r2': 0.5737885533770486}}

Specifying the reduction function

By default, the scores for graph elements are computed by averaging the scores of the elements of the contained target variables. By specifying the reduction function, this can be adjusted to, e.g., the minimum, median, or maximum. Different reductions can also be performed together:

[14]:
evaluator = Evaluator(graph=graph,
                      data=data,
                      inputs=inputs,
                      metric={'error': 'rmse', 'rel. error': 'r_rmse', 'r2': 'r2'},
                      reduction={'min': 'min', 'mean': 'mean', 'max': 'max'}
                     )

evaluator()
[14]:
{'graph': {'error': {'min': 0.496969024304679,
   'mean': 0.49759575813123313,
   'max': 0.4982224919577873},
  'rel. error': {'min': 0.6447371744514122,
   'mean': 0.648792945470825,
   'max': 0.6528487164902379},
  'r2': {'min': 0.573788553377049,
   'mean': 0.5790512646287292,
   'max': 0.5843139758804095}},
 'graph/e': None,
 'graph/e/x': None,
 'graph/f': {'error': {'min': 0.496969024304679,
   'mean': 0.49759575813123313,
   'max': 0.4982224919577873},
  'rel. error': {'min': 0.6447371744514122,
   'mean': 0.648792945470825,
   'max': 0.6528487164902379},
  'r2': {'min': 0.573788553377049,
   'mean': 0.5790512646287292,
   'max': 0.5843139758804095}},
 'graph/f/y': {'error': {'min': 0.49696902430467926,
   'mean': 0.49696902430467926,
   'max': 0.49696902430467926},
  'rel. error': {'min': 0.6447371744514128,
   'mean': 0.6447371744514128,
   'max': 0.6447371744514128},
  'r2': {'min': 0.5843139758804086,
   'mean': 0.5843139758804086,
   'max': 0.5843139758804086}},
 'graph/f/z': {'error': {'min': 0.49822249195778734,
   'mean': 0.49822249195778734,
   'max': 0.49822249195778734},
  'rel. error': {'min': 0.6528487164902381,
   'mean': 0.6528487164902381,
   'max': 0.6528487164902381},
  'r2': {'min': 0.5737885533770486,
   'mean': 0.5737885533770486,
   'max': 0.5737885533770486}}}

Choosing a method

The method for computing predictions can be specified using the method argument.

This will create an evaluator that employs the MAP (a.k.a. maximum posterior) method for predictions:

[15]:
map_evaluator = Evaluator(graph=graph,
                          data=data,
                          inputs=inputs,
                          method="MAP")

Additional arguments to the underlying model implementing the method can be specified in model_args. Arguments to the solver employed by the model can be specified using the solver_args argument. See the documentation for the available methods and model and solver arguments they accept.

Choosing a measure

By default, the evaluator empoys mean values as predictions. Other statistical measures can be employed as well for predictions. Meaningful choices are measures of location such as mean and median. For example, this evaluates the model employing the median in predictions:

[16]:
median_evaluator = Evaluator(graph=graph,
                             data=data,
                             inputs=inputs,
                             measure='median')
median_evaluator()
[16]:
{'graph': 0.5790512646287291,
 'graph/e': None,
 'graph/e/x': None,
 'graph/f': 0.5790512646287291,
 'graph/f/y': 0.5843139758804086,
 'graph/f/z': 0.5737885533770486}

A user can also provide a custom measure provided it return arrays of the same shape as the variables.

Accuracy

Besides choosing the method and its model and solver parameters, the accuracy of the model predictions - and thus the scores - may be influenced by changing the number of examples used to estimate the predictions from.

A small number of examples makes predictions and evaluation faster, but potentially less accurate. A MAP model does not require more than one example.

[17]:
fast_evaluator = Evaluator(graph=graph,
                           data=data,
                           inputs=inputs,
                           method='MAP',
                           n_samples=1)
fast_evaluator()
[17]:
{'graph': 0.5790512646287291,
 'graph/e': None,
 'graph/e/x': None,
 'graph/f': 0.5790512646287291,
 'graph/f/y': 0.5843139758804086,
 'graph/f/z': 0.5737885533770486}

Employing a MAP-Fisher model with a many examples may yield more accurate predictions and scores (though not in this simple example) at the expense of speed:

[18]:
slow_evaluator = Evaluator(graph=graph,
                           data=data,
                           inputs=inputs,
                           method='MAPFisher',
                           n_samples=10000)
slow_evaluator()
[18]:
{'graph': 0.5790512646287289,
 'graph/e': None,
 'graph/e/x': None,
 'graph/f': 0.5790512646287289,
 'graph/f/y': 0.5843139758804086,
 'graph/f/z': 0.5737885533770481}
[ ]: