1
votes

I am trying to implement a solution to Ridge regression in Python using Stochastic gradient descent as the solver. My code for SGD is as follows:

def fit(self, X, Y):
    # Convert to data frame in case X is numpy matrix
    X = pd.DataFrame(X)

    # Define a function to calculate the error given a weight vector beta and a training example xi, yi

    # Prepend a column of 1s to the data for the intercept
    X.insert(0, 'intercept', np.array([1.0]*X.shape[0]))

    # Find dimensions of train
    m, d = X.shape

    # Initialize weights to random
    beta = self.initializeRandomWeights(d)
    beta_prev = None

    epochs = 0
    prev_error = None
    while (beta_prev is None or epochs < self.nb_epochs):
        print("## Epoch: " + str(epochs))
        indices = range(0, m)
        shuffle(indices)
        for i in indices:   # Pick a training example from a randomly shuffled set
            beta_prev = beta
            xi = X.iloc[i]
            errori = sum(beta*xi) - Y[i]    # Error[i] = sum(beta*x) - y = error of ith training example
            gradient_vector = xi*errori + self.l*beta_prev
            beta = beta_prev - self.alpha*gradient_vector
        epochs += 1

The data I'm testing this on is not normalized and my implementation always ends up with all the weights being Infinity, even though I initialize the weights vector to low values. Only when I set the learning rate alpha to a very small value ~1e-8, the algorithm ends up with valid values of the weights vector.

My understanding is that normalizing/scaling input features only helps reduce convergence time. But the algorithm should not fail to converge as a whole if the features are not normalized. Is my understanding correct?

2

2 Answers

1
votes

You can check from scikit-learn's Stochastic Gradient Descent documentation that one of the disadvantages of the algorithm is that it is sensitive to feature scaling. In general, gradient based optimization algorithms converge faster on normalized data.

Also, normalization is advantageous for regression methods.

The updates to the coefficients during each step will depend on the ranges of each feature. Also, the regularization term will be affected heavily by large feature values.

SGD may converge without data normalization, but that is subjective to the data at hand. Therefore, your assumption is not correct.

0
votes

Your assumption is not correct.

It's hard to answer this, because there are so many different methods/environments but i will try to mention some points.

Normalization

  • When some method is not scale-invariant (i think every linear-regression is not) you really should normalize your data
    • I take it that you are just ignoring this because of debugging / analyzing
  • Normalizing your data is not only relevant for convergence-time, the results will differ too (think about the effect within the loss-function; big values might effect in much more loss to small ones)!

Convergence

  • There is probably much to tell about convergence of many methods on normalized/non-normalized data, but your case is special:
    • SGD's convergence theory only guarantees convergence to some local-minimum (= global-minimum in your convex-opt problem) for some chosings of hyper-parameters (learning-rate and learning-schedule/decay)
    • Even optimizing normalized data can fail with SGD when those params are bad!
      • This is one of the most important downsides of SGD; dependence on hyper-parameters
    • As SGD is based on gradients and step-sizes, non-normalized data has a possibly huge effect on not achieving this convergence!