Scikit-learn: avoid overfitting in Gaussian Process regression

Question

I am training a Gaussian Process to learn the mapping between a set of coordinates x,y,z and some time series. In a nutshell, my question is about how to prevent my GP to do oerfitting, which I am facing to an odd level.

Some details:

my training set is made of 1500 samples. My testing set of 500 samples. Each time sample has 20 time components;
I don't have a preference in terms of what kernel to use for the GP, and I would appreciate help in understanding which one could work better. Furthermore, I have very little experience with GP in general, hence I am not sure how well I am doing with the hyperparameters. See below for how I set my length_scale: I set it this way following some advice, but I am wondering if it makes sense;
my coordinates are standardized (mean 0, std 1), but my time series are not;
I am training one Gaussian Process for each time component.

Here is my code:

from __future__ import division

import numpy as np
from matplotlib import pyplot as plt

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import (RBF, Matern, RationalQuadratic, ExpSineSquared, DotProduct, ConstantKernel)                      
# ----------------------------------------------------------------------
number_of_training_samples = 1500
number_of_testing_samples = 500

# read coordinates STANDARDIZED
coords_training_stand = np.loadtxt('coordinates_training_standardized.txt')
coords_testing_stand = np.loadtxt('coordinates_testing_standardized.txt')

# read time series TRAIN/TEST 
timeseries_training = np.loadtxt('timeseries_training.txt')
timeseries_testing = np.loadtxt('timeseries_testing.txt')
number_of_time_components = np.shape(timeseries_training)[1] # 20

# Instantiate a Gaussian Process model
kernel = 1.0 * Matern(nu=1.5, length_scale=np.ones(coords_training_stand.shape[1]))
gp = GaussianProcessRegressor(kernel=kernel)

# placeholder for predictions
pred_timeseries_training = np.zeros((np.shape(timeseries_training)))
pred_timeseries_testing = np.zeros((np.shape(timeseries_testing)))

for i in range(number_of_time_components):
    print("time component", i)

    gp.fit(coords_training_stand, timeseries_training[:,i])

    y_pred, sigma = gp.predict(coords_training_stand, return_std=True)
    y_pred_test, sigma_test = gp.predict(coords_testing_stand, return_std=True)

    pred_timeseries_training[:,i] = y_pred
    pred_timeseries_testing[:,i] = y_pred_test

# plot training
fig, ax = plt.subplots(5, figsize=(10,20))
for i in range(5):
        ax[i].plot(timeseries_training[100*i, :20], color='blue', label='Original train') 
        ax[i].plot(pred_timeseries_training[100*i], color='black', label='GP pred train')
        ax[i].set_xlabel('Time components', fontsize='x-large')
        ax[i].set_ylabel('Amplitude', fontsize='x-large')
        ax[i].set_title('Time series n. {:}'.format(100*i+1), fontsize='x-large')
        ax[i].legend(fontsize='x-large')
plt.subplots_adjust(hspace=1)
plt.show()
plt.close()

# plot testing
fig, ax = plt.subplots(5, figsize=(10,20))
for i in range(5):
        ax[i].plot(timeseries_testing[100*i, :20], color='blue', label='Original test')
        ax[i].plot(pred_timeseries_testing[100*i], color='black', label='GP pred test')
        ax[i].set_xlabel('Time components', fontsize='x-large')
        ax[i].set_ylabel('Amplitude', fontsize='x-large')
        ax[i].set_title('Time series n. {:}'.format(1500+100*i+1), fontsize='x-large')
        ax[i].legend(fontsize='x-large')
plt.subplots_adjust(hspace=1)
plt.show()
plt.close()

Here the plot of a few samples from the TRAINING set and the corresponding GP predictions (one can't even see the blue lines, corresponding to the original samples, because they are perfectly covered by the predictions of the GP):

Here the plot of a few samples from the TESTING set and the corresponding GP predictions:

(in only one case - 1801 - the prediction is good). I think there is a very strong overfitting going on, and I would like to understand how to avoid it.

Caterpillar Caterpillar · Accepted Answer · 2019-11-15T14:14:22

I don't think the problem is with the Gaussian Process itself but with the dataset.

How were the time series samples generated ? and how did you divide the dataset in training and test set ?

If you got one big time series and then cut it in small sequences, there is not enough real examples for the model to learn and you can get big overfitting problems.

Explanation with an example :

I have one big time series t0, t1, t2, t3, ..., t99

I make a training dataset of 80 samples with [t0,...,t19], [t1,...,t20], [t2,...,t21], ..., [t80,...,t99]

In this case all my samples are almost exactly the same and it will cause overfitting. And if the validation set is composed of some random samples taken from this dataset then I'll get a very high validation accuracy because the model saw almost exactly the same thing in the dataset. (I think that's what might have happended for example 1801 you gave)

So make sure all your samples in your datasets are completely independant.

Scikit-learn: avoid overfitting in Gaussian Process regression

1 Answers