Ridge regression using stochastic gradient descent in Python

Question

I am trying to implement a solution to Ridge regression in Python using Stochastic gradient descent as the solver. My code for SGD is as follows:

def fit(self, X, Y):
    # Convert to data frame in case X is numpy matrix
    X = pd.DataFrame(X)

    # Define a function to calculate the error given a weight vector beta and a training example xi, yi

    # Prepend a column of 1s to the data for the intercept
    X.insert(0, 'intercept', np.array([1.0]*X.shape[0]))

    # Find dimensions of train
    m, d = X.shape

    # Initialize weights to random
    beta = self.initializeRandomWeights(d)
    beta_prev = None

    epochs = 0
    prev_error = None
    while (beta_prev is None or epochs < self.nb_epochs):
        print("## Epoch: " + str(epochs))
        indices = range(0, m)
        shuffle(indices)
        for i in indices:   # Pick a training example from a randomly shuffled set
            beta_prev = beta
            xi = X.iloc[i]
            errori = sum(beta*xi) - Y[i]    # Error[i] = sum(beta*x) - y = error of ith training example
            gradient_vector = xi*errori + self.l*beta_prev
            beta = beta_prev - self.alpha*gradient_vector
        epochs += 1

The data I'm testing this on is not normalized and my implementation always ends up with all the weights being Infinity, even though I initialize the weights vector to low values. Only when I set the learning rate alpha to a very small value ~1e-8, the algorithm ends up with valid values of the weights vector.

My understanding is that normalizing/scaling input features only helps reduce convergence time. But the algorithm should not fail to converge as a whole if the features are not normalized. Is my understanding correct?

Divyesh Peshavaria Divyesh Peshavaria · Accepted Answer · 2017-08-21T13:42:52

You can check from scikit-learn's Stochastic Gradient Descent documentation that one of the disadvantages of the algorithm is that it is sensitive to feature scaling. In general, gradient based optimization algorithms converge faster on normalized data.

Also, normalization is advantageous for regression methods.

The updates to the coefficients during each step will depend on the ranges of each feature. Also, the regularization term will be affected heavily by large feature values.

SGD may converge without data normalization, but that is subjective to the data at hand. Therefore, your assumption is not correct.

Ridge regression using stochastic gradient descent in Python

2 Answers

Normalization

Convergence