How does multiple target Ridge Regression work in scikit learn?

Question

I am struggling to understand the following:

Scikit-learn offers a multiple output version for Ridge Regression, simply by handing over a 2D array [n_samples, n_targets], but how is it implemented?

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

Is it correct to assume that each regression for each target is independent? Under these circumstances, how can I adapt this to use individual alpha regularization parameters for each regression? If I use GridSeachCV, I would have to hand over a matrix of possible regularization parameters, or how would that work?

Thanks in advance - I have been searching for hours but could not find anything on this topic.

From the documentation, its written that it has inbuilt support. So maybe they are not independent, (maybe independent for some solvers but not all). You should ask this on the scikit-learn mailing list of github. — Vivek Kumar
Thanks, I have subscribed to the list and emailed them. If anyone knows what is going on any additional help is highly appreciated! — user2933782

dcneuro dcneuro · Accepted Answer · 2018-08-28T05:35:36

I will give this a shot since I have been looking into this a bit for my own work. I'll breakdown the question to parts so you can only look at the ones that interest you:

Q1: Is the regression for each target (aka output) in multiple output Ridge regression independent?

A1: I think that the typical multiple output linear regression with M outputs is the same as M independent single output linear regression. I think this is the case since the expression for the ordinary least squares for the multiple out case is the same as the expression for (the sum of) M independent, single output cases. To motivate this, let's consider a silly, bivariate output case without regularisation.

Let's consider two column vector inputs x₁ and x₂, with corresponding weight vectors w₁ and w₂.

These give us two univariate outputs, y₁ = x₁w₁^T + e₁ and y₂ = x₂w₂^T + e₂, where the e s are independent errors.

The sum of the squared erros is written as:

e₁² + e₂² = (y₁ - x₁w₁^T)² + (y₂ - x₂w₂^T)²

We can see that this is just the sum of the squared errors of the two independent regressions. Now, to optimise we differentiate with respect to the weights and set to zero. Since y₁ does not depend w₂, and vice versa for y₂ and w₁, the optimisation can be carried out independently for each target.

I considered one sample here for illustration but nothing much changes with more samples. You can write it out for yourself. Adding a penalty term in the form |w₁| or |w₂| also does not change this since there would still be no dependence of w₂ on y₁, or vice versa for y₂ and w₁.

Ok so that's the kind of proof that will get you a C- (with a generous professor at that). In so far as this is reflective of sklearn, manually implementing independent regressions and the inbuilt multiple output support would give the same result. So let's check this with some code (I'M using py2.7 with Jupyter):

Things we'll need

 import numpy as np
 import matplotlib.pyplot as plt
 from sklearn import linear_model, model_selection

Set up the data

## set up some test data
# T samples, K features, M outputs (aka targets)
T = 1000
K = 100
M = 500

#get the samples from independent, multivariate normal
means_X = np.zeros(K)
cov_X = np.identity(K) 
X = np.random.multivariate_normal(means_X,cov_X,T)

#Make up some random weights. 
#Here I use an exponential form since that means some would be quite small, and thus regularization is likely to help
#However for the purposes of the example it doesn't really matter

#exponential weights
W = 2.0 ** np.random.randint(-10,0,M * K) 

#shape into a weight matrix correctly
W = W.reshape((K,M))

# get the ouput - apply linear transformation
Y = np.matmul(X, W)

# add a bit of noise to the output
noise_level = 0.1
noise_means = np.zeros(M)
noise_cov = np.identity(M) 

Y_nse = Y + noise_level * np.random.multivariate_normal(noise_means,noise_cov,T)

# Start with one alpha value for all targets 
alpha = 1

Use sklearn built in multiple output support

#%%timeit -n 1 -r 1
# you can uncomment the above to get timming but note that this runs on a seperate session so 
# the results won't be available here 
## use built in MIMO support 

built_in_MIMO = linear_model.Ridge(alpha = alpha)
built_in_MIMO.fit(X, Y_nse)

Run the optimisation independently for the outputs

# %%timeit -n 1 -r 1 -o
## manual mimo
manual_MIMO_coefs = np.zeros((K,M))

for output_index in range(M):
    
    manual_MIMO = linear_model.Ridge(alpha = alpha)
    manual_MIMO.fit(X, Y_nse[:,output_index]) 
    manual_MIMO_coefs[:,output_index] = manual_MIMO.coef_

Compare the estimates (plots not included)

manual_MIMO_coefs_T = manual_MIMO_coefs.T

## check the weights. Plot a couple
check_these_weights = [0, 10]
plt.plot(built_in_MIMO.coef_[check_these_weights[0],:],'r')
plt.plot(manual_MIMO_coefs_T[check_these_weights[0],:], 'k--')

# plt.plot(built_in_MIMO.coef_[check_these_weights[1],:],'b')
# plt.plot(manual_MIMO_coefs_T[check_these_weights[1],:], 'y--')

plt.gca().set(xlabel = 'weight index', ylabel = 'weight value' )
plt.show()

print('Average diff between manual and built in weights is %f ' % ((built_in_MIMO.coef_.flatten()-manual_MIMO_coefs_T.flatten()) ** 2).mean())


## FYI, our estimate are pretty close to the orignal too, 
plt.clf()
plt.plot(built_in_MIMO.coef_[check_these_weights[1],:],'b')
plt.plot(W[:,check_these_weights[1]], 'y--')
plt.gca().set(xlabel = 'weight index', ylabel = 'weight value' )
plt.legend(['Estimated', 'True'])

plt.show()

print('Average diff between manual and built in weights is %f ' % ((built_in_MIMO.coef_.T.flatten()-W.flatten()) ** 2).mean())

The output is (not including plots here)

Average diff between manual and built in weights is 0.000000 

Average diff between manual and built in weights is 0.000011

So we see that the inbuilt sklearn estimation is the same as our manual one. However, the inbuilt one is much faster since it uses matrix algebra to solve the whole thing once, as opposed to the loop I used here (for the matrix formulation of Ridge regularisation see the wiki on Tikhonov regularization). You can check this for yourself by uncommenting the %%timeit magics above)

Q2: How can we use individual alpha regularization parameters for each regression?

A2: sklearn Ridge accepts different regularizations for each output (target). For example continuing the code above, to use different alphas for each output:

# now try different alphas for each target.
# Simply randomly assign them between min and max range 
min_alpha = 0
max_alpha = 50
alphas = 2.0 ** np.random.randint(min_alpha,max_alpha,M)
built_in_MIMO = linear_model.Ridge(alpha = alphas)
built_in_MIMO.fit(X, Y_nse)

If you compare this with a manual implementation of M independent regressions each with its own alpha:

manual_MIMO_coefs = np.zeros((K,M))

for output_index in range(M):
    
    manual_MIMO = linear_model.Ridge(alpha = alphas[output_index])
    manual_MIMO.fit(X, Y_nse[:,output_index]) 
    manual_MIMO_coefs[:,output_index] = manual_MIMO.coef_

You get the same result:

manual_MIMO_coefs_T = manual_MIMO_coefs.T

## check the weights. 
print('Average diff between manual and built in weights is %f ' % ((built_in_MIMO.coef_.flatten()-manual_MIMO_coefs_T.flatten()) ** 2).mean())

Average diff between manual and built in weights is 0.000000

So these are the same.

However, in this case, the performance greatly depends on the solvers (as sort of intuited by @Vivek Kumar).

By default, Ridge.fit() goes with a Cholesky factorization (at least for non sparse data), looking in the code for that on github (_solve_cholesky in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/ridge.py), I see that when the alphas are chosen separately for each target, sklearn actually does fit them separately. I don't know if this is inherent to Cholesky or just an implementation thing (I have a feeling it's the latter).

However, for different solvers, for example SVD - based (_solve_svd()), the code seems to already incorporate the different alphas into the matrix formulation of the problem, so the whole thing is only run once. This means that when the alphas are chosen separately for each output, and when there are many outputs, the svd solver can be much much faster.

Q3: How do I use GridSeachCV? Do I hand over a matrix of possible regularization parameters?

A3: I haven't used the inbuilt grid search because it didn't quite fit my problem. However, with the above explained, it is straightforward to implement this; just get some CV folds using sklearn.model_selection.KFold() or similar, then train for each fold using different alphas:

from sklearn import metrics, model_selection
# just two folds for now
n_splits = 2
#logarithmic grid
alphas = 2.0 ** np.arange(0,10) 
kf = model_selection.KFold(n_splits=n_splits)

# generates some folds
kf.get_n_splits(X)

# we will keep track of the performance of each alpha here
scores = np.zeros((n_splits,alphas.shape[0],M))

#loop over alphas and folds
for j,(train_index, test_index) in enumerate(kf.split(X)):
    
    for i,alpha in enumerate(alphas):
        
        cv_MIMO = linear_model.Ridge(alpha = alpha)
        cv_MIMO.fit(X[train_index,:], Y_nse[train_index,:]) 
        cv_preds = cv_MIMO.predict(X[test_index,:])
        scores[j,i,:] = metrics.r2_score(Y_nse[test_index,:], cv_preds, multioutput='raw_values')

## mean CV score  
mean_CV_score = scores.mean(axis = 0)
# best alpha for each target
best_alpha_for_target = alphas[np.argmax(mean_CV_score,axis = 0)]

I wrote this one fairly hurriedly so check it carefully. Note that we need to use the metric module since the inbuilt Ridge.score() averages over all the targets, which we don't want here. By using multioutput='raw_values' option we keep the raw value for each target.

Hope that helps!

How does multiple target Ridge Regression work in scikit learn?

1 Answers