I am training a Gaussian Process to learn the mapping between a set of coordinates x,y,z and some time series. In a nutshell, my question is about how to prevent my GP to do oerfitting, which I am facing to an odd level.
Some details:
my training set is made of 1500 samples. My testing set of 500 samples. Each time sample has 20 time components;
I don't have a preference in terms of what kernel to use for the GP, and I would appreciate help in understanding which one could work better. Furthermore, I have very little experience with GP in general, hence I am not sure how well I am doing with the hyperparameters. See below for how I set my
length_scale: I set it this way following some advice, but I am wondering if it makes sense;my coordinates are standardized (mean 0, std 1), but my time series are not;
I am training one Gaussian Process for each time component.
Here is my code:
from __future__ import division
import numpy as np
from matplotlib import pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import (RBF, Matern, RationalQuadratic, ExpSineSquared, DotProduct, ConstantKernel)
# ----------------------------------------------------------------------
number_of_training_samples = 1500
number_of_testing_samples = 500
# read coordinates STANDARDIZED
coords_training_stand = np.loadtxt('coordinates_training_standardized.txt')
coords_testing_stand = np.loadtxt('coordinates_testing_standardized.txt')
# read time series TRAIN/TEST
timeseries_training = np.loadtxt('timeseries_training.txt')
timeseries_testing = np.loadtxt('timeseries_testing.txt')
number_of_time_components = np.shape(timeseries_training)[1] # 20
# Instantiate a Gaussian Process model
kernel = 1.0 * Matern(nu=1.5, length_scale=np.ones(coords_training_stand.shape[1]))
gp = GaussianProcessRegressor(kernel=kernel)
# placeholder for predictions
pred_timeseries_training = np.zeros((np.shape(timeseries_training)))
pred_timeseries_testing = np.zeros((np.shape(timeseries_testing)))
for i in range(number_of_time_components):
print("time component", i)
gp.fit(coords_training_stand, timeseries_training[:,i])
y_pred, sigma = gp.predict(coords_training_stand, return_std=True)
y_pred_test, sigma_test = gp.predict(coords_testing_stand, return_std=True)
pred_timeseries_training[:,i] = y_pred
pred_timeseries_testing[:,i] = y_pred_test
# plot training
fig, ax = plt.subplots(5, figsize=(10,20))
for i in range(5):
ax[i].plot(timeseries_training[100*i, :20], color='blue', label='Original train')
ax[i].plot(pred_timeseries_training[100*i], color='black', label='GP pred train')
ax[i].set_xlabel('Time components', fontsize='x-large')
ax[i].set_ylabel('Amplitude', fontsize='x-large')
ax[i].set_title('Time series n. {:}'.format(100*i+1), fontsize='x-large')
ax[i].legend(fontsize='x-large')
plt.subplots_adjust(hspace=1)
plt.show()
plt.close()
# plot testing
fig, ax = plt.subplots(5, figsize=(10,20))
for i in range(5):
ax[i].plot(timeseries_testing[100*i, :20], color='blue', label='Original test')
ax[i].plot(pred_timeseries_testing[100*i], color='black', label='GP pred test')
ax[i].set_xlabel('Time components', fontsize='x-large')
ax[i].set_ylabel('Amplitude', fontsize='x-large')
ax[i].set_title('Time series n. {:}'.format(1500+100*i+1), fontsize='x-large')
ax[i].legend(fontsize='x-large')
plt.subplots_adjust(hspace=1)
plt.show()
plt.close()
Here the plot of a few samples from the TRAINING set and the corresponding GP predictions (one can't even see the blue lines, corresponding to the original samples, because they are perfectly covered by the predictions of the GP):
Here the plot of a few samples from the TESTING set and the corresponding GP predictions:
(in only one case - 1801 - the prediction is good). I think there is a very strong overfitting going on, and I would like to understand how to avoid it.

