Batch Gradient Descent for Logistic Regression

Question

I've been following Andrew Ng CSC229 machine learning course, and am now covering logistic regression. The goal is to maximize the log likelihood function and find the optimal values of theta to do so. The link to the lecture notes is: [http://cs229.stanford.edu/notes/cs229-notes1.ps][1] -pages 16-19. Now the code below was shown on the course homepage (in matlab though--I converted it to python).

I'm applying it to a data set with 100 training examples (a data set given on the Coursera homepage for a introductory machine learning course). The data has two features which are two scores on two exams. The output is a 1 if the student received admission and 0 is the student did not receive admission. The have shown all of the code below. The following code causes the likelihood function to converge to maximum of about -62. The corresponding values of theta are [-0.05560301 0.01081111 0.00088362]. Using these values when I test out a training example like [1, 30.28671077, 43.89499752] which should give a value of 0 as output, I obtain 0.576 which makes no sense to me. If I test the hypothesis function with input [1, 10, 10] I obtain 0.515 which once again makes no sense. These values should correspond to a lower probability. This has me quite confused.

import numpy as np
import sig as s

def batchlogreg(X, y):
   max_iterations = 800
   alpha = 0.00001

   (m,n) = np.shape(X)

   X = np.insert(X, 0, 1, 1) 
   theta = np.array([0] * (n+1), 'float')
   ll = np.array([0] * max_iterations, 'float')

   for i in range(max_iterations):
       hx = s.sigmoid(np.dot(X, theta))
       d = y - hx
       theta = theta + alpha*np.dot(np.transpose(X),d)
       ll[i] = sum(y * np.log(hx) + (1-y) * np.log(1- hx))

   return (theta, ll)

The link doesn't work. Also, you should really post the relevant text in your question rather than just give a link to it anyway. — Kmeixner
apologies here is a working version cs229.stanford.edu/notes/cs229-notes1.ps — Aerole
Have you tried a larger learning rate (alpha)? It's usually not that small, so you might not be training your model properly. Try 0.1, 0.001 and so on. — IVlad

IVlad IVlad · Accepted Answer · 2015-05-05T20:10:11

Note that the sigmoid function has:

sig(0) = 0.5
sig(x > 0) > 0.5
sig(x < 0) < 0.5

Since you get all probabilities above 0.5, this suggests that you never make X * theta negative, or that you do, but your learning rate is too small to make it matter.

for i in range(max_iterations):
    hx = s.sigmoid(np.dot(X, theta)) # this will probably be > 0.5 initially
    d = y - hx # then this will be "very" negative when y is 0
    theta = theta + alpha*np.dot(np.transpose(X),d) # (1)
    ll[i] = sum(y * np.log(hx) + (1-y) * np.log(1- hx))

The problem is most likely at (1). The dot product will be very negative, but your alpha is very small and will negate its effect. So theta will never decrease enough to properly handle correctly classifying labels that are 0.

Positive instances are then only barely correctly classified for the same reason: your algorithm does not discover a reasonable hypothesis under your number of iterations and learning rate.

Possible solution: increase alpha and / or the number of iterations, or use momentum.

Batch Gradient Descent for Logistic Regression

3 Answers