Training with missing data#

Let us load a data set and have a look

[2]:

import numpy as np
import pandas as pd

x_train = pd.read_csv("training_data_input.csv")
y_train = pd.read_csv("training_data_output.csv")

display(pd.concat([x_train, y_train], axis=1))

	feature_0	feature_1	feature_2	feature_3	target_0	target_1	target_2	target_3
0	0.472986	NaN	0.242439	-1.700736	6.465163	1.247767	1.562335	0.563148
1	0.753143	-1.534721	NaN	-0.120228	NaN	1.461133	1.682604	-1.078739
2	-0.806982	2.871819	NaN	0.472457	-2.545614	NaN	-3.310312	0.749930
3	NaN	NaN	1.342356	-0.122150	NaN	1.049569	0.504198	NaN
4	1.012515	-0.913869	-1.029530	1.209796	NaN	NaN	1.132657	-0.651620
...	...	...	...	...	...	...	...	...
95	0.078516	-0.837245	1.094795	NaN	3.867749	1.255217	0.865133	NaN
96	0.959965	-1.167800	-0.334090	0.827424	0.544013	2.263673	NaN	-0.057245
97	0.865017	-0.855405	0.071817	-1.125955	5.417294	1.349000	1.600092	0.322496
98	-0.206309	0.421580	NaN	1.481052	-3.566368	1.444973	-0.434093	-1.330253
99	0.495926	NaN	-0.565377	-0.131805	1.555337	1.582580	0.622529	0.949257

100 rows × 8 columns

We can see that we have a data set with 4 features and 4 targets. There seem to be lots of missing entries in both the features and the targets.

[3]:

print("missing entries in input data:", "{}%".format(int(np.round(x_train.isna().mean().mean()*100))))
print("missing entries in output data:", "{}%".format(int(np.round(y_train.isna().mean().mean()*100))))

missing entries in input data: 28%
missing entries in output data: 30%

We want to train a linear regression. What can we do about the missing entries?#

Method 1: Mean imputation#

Fill all the missing entries with the mean value of their respective column

[4]:

x_train_imputed = x_train.fillna(value=x_train.mean())
y_train_imputed = y_train.fillna(value=y_train.mean())

display(pd.concat([x_train_imputed, y_train_imputed], axis=1))

	feature_0	feature_1	feature_2	feature_3	target_0	target_1	target_2	target_3
0	0.472986	-0.071868	0.242439	-1.700736	6.465163	1.247767	1.562335	0.563148
1	0.753143	-1.534721	-0.183861	-0.120228	1.655926	1.461133	1.682604	-1.078739
2	-0.806982	2.871819	-0.183861	0.472457	-2.545614	1.605148	-3.310312	0.749930
3	0.064623	-0.071868	1.342356	-0.122150	1.655926	1.049569	0.504198	-0.312009
4	1.012515	-0.913869	-1.029530	1.209796	1.655926	1.605148	1.132657	-0.651620
...	...	...	...	...	...	...	...	...
95	0.078516	-0.837245	1.094795	-0.093939	3.867749	1.255217	0.865133	-0.312009
96	0.959965	-1.167800	-0.334090	0.827424	0.544013	2.263673	0.136459	-0.057245
97	0.865017	-0.855405	0.071817	-1.125955	5.417294	1.349000	1.600092	0.322496
98	-0.206309	0.421580	-0.183861	1.481052	-3.566368	1.444973	-0.434093	-1.330253
99	0.495926	-0.071868	-0.565377	-0.131805	1.555337	1.582580	0.622529	0.949257

100 rows × 8 columns

and train a regressor with the imputed data

[5]:

from sklearn.linear_model import LinearRegression
imputed_linear_regression = LinearRegression()
imputed_linear_regression.fit(x_train_imputed, y_train_imputed)

[5]:

LinearRegression()

Method 2: Deletion#

Remove all examples which are missing entries

[6]:

joint_dropped = pd.concat([x_train, y_train], axis=1).dropna(how="any")
x_train_dropped = joint_dropped[x_train.columns]
y_train_dropped = joint_dropped[y_train.columns]

display(pd.concat([x_train_dropped, y_train_dropped], axis=1))

	feature_0	feature_1	feature_2	feature_3	target_0	target_1	target_2	target_3
50	0.507523	-0.618371	0.790793	-0.834405	3.230641	0.885857	0.816932	-0.239825
54	-0.809670	0.500495	-0.193510	-0.664203	0.784997	1.141402	-1.039911	-0.013783
67	-1.556314	-0.693315	1.624609	-0.120666	-2.247848	0.721211	-0.982030	-4.287106
68	-2.348582	0.167257	1.699965	1.168899	-7.552532	0.995456	-3.366157	-5.583932
70	-0.488821	1.632122	-0.401225	1.009360	-3.450892	1.750701	-2.670395	-0.953182
92	-1.160888	-0.579329	0.279841	-0.409602	0.251455	1.572128	-0.305981	-1.609754
97	0.865017	-0.855405	0.071817	-1.125955	5.417294	1.349000	1.600092	0.322496

and train a regressor with the remaining data

[7]:

dropped_linear_regression = LinearRegression()
dropped_linear_regression.fit(x_train_dropped, y_train_dropped)

[7]:

LinearRegression()

Method 3: Bayesian model#

A Bayesian model can just treat the missing entries as unknowns

[8]:

import halerium.core as hal
from halerium.core.regression import connect_via_regression

g = hal.Graph("g")
with g:
    x = hal.Variable("x", shape=(4,), mean=0, variance=1)
    y = hal.Variable("y", shape=(4,), variance=0.1)
    connect_via_regression("reg", inputs=[x], outputs=[y], order=1)

# run this to show the graph in the online platform
# hal.show(g)

bayesian_train_model = hal.get_posterior_model(g, data={g.x: x_train, g.y: y_train}, method="MAP")
bayesian_post_graph = bayesian_train_model.get_posterior_graph()

The Bayesian model will actually calculate an estimate for each missing entry (or rather a probability distribution)

[9]:

x_train_bayesian_imputed = bayesian_train_model.get_means(g.x)
x_train_bayesian_imputed = pd.DataFrame(data=x_train_bayesian_imputed, columns=x_train.columns)

from plots import display_side_by_side
display_side_by_side(x_train, x_train_bayesian_imputed)

	feature_0	feature_1	feature_2	feature_3
0	0.472986	NaN	0.242439	-1.700736
1	0.753143	-1.534721	NaN	-0.120228
2	-0.806982	2.871819	NaN	0.472457
3	NaN	NaN	1.342356	-0.122150
4	1.012515	-0.913869	-1.029530	1.209796
5	0.501872	0.138846	0.640761	NaN
6	-1.154360	NaN	-1.681757	-1.788094
7	-2.218535	-0.647431	NaN	-0.039209
8	NaN	NaN	-0.253904	0.073252
9	-0.997204	-0.713856	NaN	-0.677945
10	-0.571881	-0.105862	NaN	0.318665
11	-0.337595	NaN	-0.114920	2.241818
12	NaN	0.535136	0.232490	0.867612
13	-1.148213	NaN	1.000943	NaN
14	NaN	NaN	0.050523	NaN
15	0.943575	0.357644	-0.083449	0.677806
16	NaN	0.222719	-1.528985	1.029211
17	-1.166259	-1.009562	-0.105268	0.512022
18	1.407728	NaN	1.471234	NaN
19	-0.461395	NaN	-0.571817	-0.603299
20	-1.339389	-1.689653	NaN	0.257773
21	1.828821	-1.001002	-2.091691	0.146560
22	-0.466351	NaN	NaN	-1.259224
23	NaN	0.802630	0.272391	-0.969176
24	0.871968	-1.446359	NaN	0.197921
25	-1.365640	NaN	0.015935	-0.080043
26	-0.250803	-0.565143	NaN	-0.782282
27	3.041686	-0.626081	NaN	-0.587336
28	NaN	1.232045	0.450889	-0.641410
29	NaN	0.965746	-1.284003	-1.274572
30	1.522842	1.461882	0.037656	-0.246197
31	NaN	NaN	NaN	-1.513087
32	NaN	0.249203	NaN	NaN
33	NaN	1.689292	0.177750	0.032006
34	1.933216	-1.062095	-0.732629	0.842741
35	1.076740	NaN	-2.619493	0.739046
36	0.667501	NaN	NaN	1.407948
37	0.051149	-0.935975	-1.839109	NaN
38	NaN	-0.561885	-1.132469	0.274291
39	0.735912	0.434319	-1.120041	0.889095
40	NaN	-2.488004	0.595909	-2.035862
41	NaN	1.057642	0.652769	NaN
42	-0.883462	0.345692	NaN	0.410710
43	NaN	0.734148	-0.125496	NaN
44	0.202231	NaN	-1.421277	-1.163588
45	NaN	0.050022	0.765430	-0.028515
46	-1.205646	NaN	0.566844	NaN
47	-0.940359	0.283607	-0.390320	-2.154124
48	NaN	-0.566221	-0.517709	NaN
49	-0.603695	NaN	-0.959012	-1.595297
50	0.507523	-0.618371	0.790793	-0.834405
51	1.309470	-1.238742	NaN	0.696147
52	1.778984	-0.796317	NaN	NaN
53	0.789916	NaN	-2.184060	-1.567268
54	-0.809670	0.500495	-0.193510	-0.664203
55	NaN	-1.658425	NaN	NaN
56	1.269859	0.150519	NaN	NaN
57	NaN	0.164989	NaN	-0.115399
58	NaN	NaN	0.475514	2.639046
59	0.691108	1.111236	-0.257684	-1.195951
60	NaN	-1.163467	-3.015915	NaN
61	0.331393	-1.072815	NaN	-0.085521
62	-0.476624	-0.963715	1.153983	-0.444866
63	NaN	-0.474993	-0.791428	-1.693119
64	-0.741163	NaN	NaN	NaN
65	NaN	NaN	-0.818418	-0.177300
66	0.032502	NaN	NaN	0.210377
67	-1.556314	-0.693315	1.624609	-0.120666
68	-2.348582	0.167257	1.699965	1.168899
69	0.055338	0.217881	NaN	-0.158261
70	-0.488821	1.632122	-0.401225	1.009360
71	-1.577518	-0.788323	-1.156447	0.410545
72	-0.633212	-0.650858	-0.925059	0.143164
73	0.975512	-0.599755	0.607099	-0.018603
74	-0.621560	0.346610	1.337491	NaN
75	0.695248	NaN	NaN	0.763436
76	0.976937	0.517606	0.249171	1.304453
77	1.116544	NaN	0.662984	-0.904909
78	-0.158939	NaN	-0.043852	-0.666356
79	NaN	NaN	NaN	-1.300151
80	-0.511364	-0.692839	NaN	1.682377
81	NaN	0.200962	0.376479	-0.193338
82	-0.536373	NaN	-0.405771	NaN
83	NaN	NaN	0.331393	NaN
84	0.980989	NaN	NaN	NaN
85	-0.077496	0.410431	0.275277	0.525207
86	NaN	2.193451	-0.159283	NaN
87	0.168298	1.370530	-0.728801	NaN
88	1.229295	0.779550	0.215736	NaN
89	1.290819	0.455251	-0.571328	-0.465401
90	-0.632571	1.413624	-0.167273	NaN
91	-0.579659	1.121277	0.619558	NaN
92	-1.160888	-0.579329	0.279841	-0.409602
93	NaN	0.020903	-0.576144	-1.103720
94	NaN	-0.939964	-0.722252	0.251525
95	0.078516	-0.837245	1.094795	NaN
96	0.959965	-1.167800	-0.334090	0.827424
97	0.865017	-0.855405	0.071817	-1.125955
98	-0.206309	0.421580	NaN	1.481052
99	0.495926	NaN	-0.565377	-0.131805

	feature_0	feature_1	feature_2	feature_3
0	0.472986	-0.759207	0.242439	-1.700736
1	0.753143	-1.534721	0.360400	-0.120228
2	-0.806982	2.871819	-0.901917	0.472457
3	0.533772	-1.179473	1.342356	-0.122150
4	1.012515	-0.913869	-1.029530	1.209796
5	0.501872	0.138846	0.640761	0.462867
6	-1.154360	0.000016	-1.681757	-1.788094
7	-2.218535	-0.647431	-0.350765	-0.039209
8	0.180386	-0.439187	-0.253904	0.073252
9	-0.997204	-0.713856	0.246540	-0.677945
10	-0.571881	-0.105862	1.447496	0.318665
11	-0.337595	-1.091010	-0.114920	2.241818
12	-2.736205	0.535136	0.232490	0.867612
13	-1.148213	1.240510	1.000943	0.583911
14	-0.133397	-0.819823	0.050523	-0.434769
15	0.943575	0.357644	-0.083449	0.677806
16	0.394197	0.222719	-1.528985	1.029211
17	-1.166259	-1.009562	-0.105268	0.512022
18	1.407728	-1.550861	1.471234	1.608291
19	-0.461395	-0.631948	-0.571817	-0.603299
20	-1.339389	-1.689653	-0.118997	0.257773
21	1.828821	-1.001002	-2.091691	0.146560
22	-0.466351	0.208242	0.408953	-1.259224
23	-0.410407	0.802630	0.272391	-0.969176
24	0.871968	-1.446359	-0.259302	0.197921
25	-1.365640	-1.612685	0.015935	-0.080043
26	-0.250803	-0.565143	-0.917229	-0.782282
27	3.041686	-0.626081	1.193158	-0.587336
28	1.073791	1.232045	0.450889	-0.641410
29	-1.003597	0.965746	-1.284003	-1.274572
30	1.522842	1.461882	0.037656	-0.246197
31	-0.706564	0.145486	-0.472127	-1.513087
32	0.418605	0.249203	-1.133117	-0.512646
33	0.436980	1.689292	0.177750	0.032006
34	1.933216	-1.062095	-0.732629	0.842741
35	1.076740	0.069814	-2.619493	0.739046
36	0.667501	-0.219459	0.635413	1.407948
37	0.051149	-0.935975	-1.839109	-0.060533
38	-0.575251	-0.561885	-1.132469	0.274291
39	0.735912	0.434319	-1.120041	0.889095
40	0.044935	-2.488004	0.595909	-2.035862
41	-0.001871	1.057642	0.652769	0.003109
42	-0.883462	0.345692	-1.679739	0.410710
43	0.109791	0.734148	-0.125496	-0.897293
44	0.202231	-0.035420	-1.421277	-1.163588
45	-1.291495	0.050022	0.765430	-0.028515
46	-1.205646	-0.207500	0.566844	0.835726
47	-0.940359	0.283607	-0.390320	-2.154124
48	-0.443188	-0.566221	-0.517709	0.358158
49	-0.603695	0.184274	-0.959012	-1.595297
50	0.507523	-0.618371	0.790793	-0.834405
51	1.309470	-1.238742	-1.157520	0.696147
52	1.778984	-0.796317	0.926392	1.833046
53	0.789916	-0.119241	-2.184060	-1.567268
54	-0.809670	0.500495	-0.193510	-0.664203
55	0.654011	-1.658425	0.240789	-0.615273
56	1.269859	0.150519	-1.137418	-0.680453
57	-1.376112	0.164989	-1.365167	-0.115399
58	0.894641	0.094943	0.475514	2.639046
59	0.691108	1.111236	-0.257684	-1.195951
60	-0.217195	-1.163467	-3.015915	0.357342
61	0.331393	-1.072815	1.607594	-0.085521
62	-0.476624	-0.963715	1.153983	-0.444866
63	-0.220168	-0.474993	-0.791428	-1.693119
64	-0.741163	-0.666493	0.601783	-1.320586
65	0.498025	-0.692433	-0.818418	-0.177300
66	0.032502	-0.380547	0.508512	0.210377
67	-1.556314	-0.693315	1.624609	-0.120666
68	-2.348582	0.167257	1.699965	1.168899
69	0.055338	0.217881	0.491295	-0.158261
70	-0.488821	1.632122	-0.401225	1.009360
71	-1.577518	-0.788323	-1.156447	0.410545
72	-0.633212	-0.650858	-0.925059	0.143164
73	0.975512	-0.599755	0.607099	-0.018603
74	-0.621560	0.346610	1.337491	-2.588218
75	0.695248	0.598883	0.739468	0.763436
76	0.976937	0.517606	0.249171	1.304453
77	1.116544	0.133607	0.662984	-0.904909
78	-0.158939	0.230231	-0.043852	-0.666356
79	1.125959	0.713062	0.539098	-1.300151
80	-0.511364	-0.692839	-0.747252	1.682377
81	2.397169	0.200962	0.376479	-0.193338
82	-0.536373	1.193916	-0.405771	-1.085892
83	0.728069	-0.042654	0.331393	0.364764
84	0.980989	0.735467	0.519223	0.578556
85	-0.077496	0.410431	0.275277	0.525207
86	-0.014028	2.193451	-0.159283	0.273037
87	0.168298	1.370530	-0.728801	-1.226624
88	1.229295	0.779550	0.215736	-0.646722
89	1.290819	0.455251	-0.571328	-0.465401
90	-0.632571	1.413624	-0.167273	-1.041896
91	-0.579659	1.121277	0.619558	-0.399106
92	-1.160888	-0.579329	0.279841	-0.409602
93	-0.364879	0.020903	-0.576144	-1.103720
94	-0.851016	-0.939964	-0.722252	0.251525
95	0.078516	-0.837245	1.094795	-1.177459
96	0.959965	-1.167800	-0.334090	0.827424
97	0.865017	-0.855405	0.071817	-1.125955
98	-0.206309	0.421580	-0.704793	1.481052
99	0.495926	0.324680	-0.565377	-0.131805

Compare the performance on test data#

[10]:

x_test = pd.read_csv("testing_data_input.csv").values
y_test = pd.read_csv("testing_data_output.csv").values

imputed_prediction = imputed_linear_regression.predict(x_test)

dropped_prediction = dropped_linear_regression.predict(x_test)

bayesian_prediction_model = hal.get_generative_model(bayesian_post_graph, data={g.x: x_test})
bayesian_prediction = bayesian_prediction_model.get_means(g.y)

[11]:

import pylab as pl

pl.figure(figsize=(12, 12))

dark_erium_green = '#00b34a'
erium_blue = '#002a43'

ax = pl.subplot(2,2,1)
ax.set_aspect("equal")
ax.scatter(y_test[:,0], imputed_prediction[:,0], color='#00b34a')
ax.plot([np.min(y_test[:,0]),np.max(y_test[:,0])], [np.min(y_test[:,0]),np.max(y_test[:,0])], ls=":", color="k")
ax.set_xlabel("real output value")
ax.set_ylabel("predicted output value")
ax.set_title("training with imputed data")

ax = pl.subplot(2,2,2)
ax.set_aspect("equal")
ax.scatter(y_test[:,0], dropped_prediction[:,0], color='#00b34a')
ax.plot([np.min(y_test[:,0]),np.max(y_test[:,0])], [np.min(y_test[:,0]),np.max(y_test[:,0])], ls=":", color="k")
ax.set_xlabel("real output value")
ax.set_ylabel("predicted output value")
ax.set_title("training with missing rows dropped")

ax = pl.subplot(2,1,2)
ax.set_aspect("equal")
ax.scatter(y_test[:,0], bayesian_prediction[:,0], color='#00b34a')
ax.plot([np.min(y_test[:,0]),np.max(y_test[:,0])], [np.min(y_test[:,0]),np.max(y_test[:,0])], ls=":", color="k")
ax.set_xlabel("real output value")
ax.set_ylabel("predicted output value")
ax.set_title("Bayesian model")

[11]:

Text(0.5, 1.0, 'Bayesian model')

../../_images/examples_03_why_care_03_training_with_missing_data_20_1.png

As you can see, Bayesian models offer interesting advantages when dealing with missing data. Missing data often occur in industrial environments, e.g. when a sensor output could not be recorded or the output was corrupted.

[ ]: