0
votes

I have had used linear regression using ML packages in python, but for sake of self gratification, I coded it from scratch. The loss starts at around 0.90 and keeps increasing (not learning) for some reason. I do not understand what mistake I may have committed.

  1. Standardised the dataset as part of preprocessing
  2. Initialise weight matrix with MLE estimate for parameter W i.e., (X^TX)^-1X^TY
  3. Compute the output
  4. Calculate gradient of loss function SSE (Sum of Squared Error) wrt param W and bias B
  5. Use the gradients to update the parameters using gradient descent.


    import preprocess as pre
    import numpy as np
    import matplotlib.pyplot as plt

    data = pre.load_file('airfoil_self_noise.dat')
    data = pre.organise(data,"\t","\r\n")
    data = pre.standardise(data,data.shape[1])

    t = np.reshape(data[:,5],[-1,1])
    data = data[:,:5]

    N = data.shape[0]
    M = 5
    lr = 1e-3

    # W = np.random.random([M,1])
    W = np.dot(np.dot(np.linalg.inv(np.dot(data.T,data)),data.T),t)
    data = data.T # Examples are arranged in columns [features,N]
    b = np.random.rand()
    epochs = 1000000
    loss = np.zeros([epochs])
    for epoch in range(epochs):
      if epoch%1000 == 0:
        lr /= 10
      # Obtain the output
      y = np.dot(W.T,data).T + b
      sse = np.dot((t-y).T,(t-y))
      loss[epoch]= sse/N
      var = sse/N
      # log likelihood
      ll = (-N/2)*(np.log(2*np.pi))-(N*np.log(np.sqrt(var)))-(sse/(2*var))

      # Gradient Descent

      W_grad = np.zeros([M,1])
      B_grad = 0
      for i in range(N):
        err = (t[i]-y[i])
        W_grad += err * np.reshape(data[:,i],[-1,1])
        B_grad += err

      W_grad /= N
      B_grad /= N

      W += lr * W_grad
      b += lr * B_grad

      print("Epoch: %d, Loss: %.3f, Log-Likelihood: %.3f"%(epoch,loss[epoch],ll))
    plt.figure()
    plt.plot(range(epochs),loss,'-r')
    plt.show()

Now if you run the above code you are likely not to find anything wrong since I am doing W += lr * W_grad instead of W -= lr * W_grad. I would like to know why this is the case because it is the gradient descent formula to subtract the gradient from old weight matrix. The error constantly increase when I do it. What is that I am missing ?

1
For linear expression there exists a closed-form expression for the optimal weights using for example MLE. I.e. there's no need to apply batch learning. - a_guest
Most importantly, you seem to be using the closed-form solution (the one that gives you the guaranteed minimum) to initialize the weights. What else but increasing loss do you expect to get? - coffeinjunky
From what I have observed, I don't think closed form solution does give the optimal weight. Instead of randomly initializing the weights I initialize it with analytical solution and then optimize it using gradient descent. If you read what I have said in the last line when I subtract the gradient from the old weight the loss starts increasing while it decreases when I add it. Why is this behavior - VM_AI
@VM_AI The closed-form expression is the solution of analytical minimization of the specified MSE loss function, so it does yield the optimal weights. Modifying those weights in any kind will lead to increased loss (as per definition). Moreover, in your last statement in the question you have two times the exact same expression; I suppose it's a typo? - a_guest
@a_guest Thanks for the catch. Yes, it was a typo, I have fixed it. With regards to you having had mentioned MLE estimate of weights does produce optimal values, then why is that I am getting MSE around 0.6 and not 0.1 or less than it ? - VM_AI

1 Answers

0
votes

Found it. The problem was I took the gradient of loss function from a slide which apparently was not right (at least it wasn't entirely wrong, instead it was already pointing to the steepest descent), which when I subtracted from weights it started pointing to the direction of greatest increase. This was what that gave rise to what I observed.

I did the partial derivative of loss function to clarify, and got this:

W_grad += data[:,i].reshape([-1,1])*(y[i]-t[i]).reshape([])

This points to the direction of greatest increase and when I multiply it with -lr it starts pointing to the steepest descent, and started working properly.