Neural network- derivative for second layer parameters wrong

Question

I am building my first neural network. Although it's encouraging to see that I get around 95-98% accuracy. I found from gradient checking that the derivatives for theta(parameters) for 2nd layer are way off from what I got using numerical gradient checking (max difference is 0.9..). My inputs are 8X8 images of digits from sklearn load_digits.

Input dimension (1797,65), with bias.
output dimension (1797,10).
Neural net architecture: 3 layers. layer1-65 nodes, layer2-101 nodes, layer3 10 nodes.

Below are python code snippets.

#forward propagation
a1 = x #(1797, 64)
a1 = np.column_stack((np.ones(m,),a1)) #(1797,65)
a2 = expit(a1.dot(theta1)) #(1797,100)
a2 = np.column_stack((np.ones(m,),a2)) #(1797,101)
a3 = expit(a2.dot(theta2)) #(1797,10)
a3[a3==1] = 0.999999 #to avoid log(1)
res1 = np.multiply(outputs,np.log(a3)) #(1797,10) .* (1797,10) 
res2 = np.multiply(1-outputs,np.log(1-a3))
lamda = 0.5
cost = (-1/m)*(res1+res2).sum(axis=1).sum() + lamda/(2*m)*(np.square(theta1[1:,:]).sum(axis=1).sum() + np.square(theta2[1:,:]).sum(axis=1).sum())

Back propagation code:

#Back propagation
delta3 = a3 - outputs
delta2 = np.multiply(delta3.dot(theta2.T),np.multiply(a2,1-a2)) #(1797,10) * (10,101) = (1797,101)
D1 = (a1.T.dot(delta2[:,1:])) #(65, 1797) * (1797,100) = (65,100)
D1[0,:] = 1/m * D1[0,:]
D1[1:,:] = 1/m * (D1[1:,:] + lamda*theta1[1:,:])
D2 = (a2.T.dot(delta3)) #(101,1797) * (1797, 10) = (101,10)
D2[0,:] = 1/m * D2[0,:]
D2[1:,:] = 1/m * (D2[1:,:] + lamda*theta2[1:,:]) #something wrong in D2 calculation steps...
#print(theta1.shape,theta2.shape,D1.shape,D2.shape)
#this is what is returned by cost function
return cost,np.concatenate((np.asarray(D1).flatten(),np.asarray(D2).flatten())) #last 1010 wrong values

As you can see the gradient is flattened. When I use numerical gradient checking I find the first 6500 numbers are very close to 'D1' with max difference = 1.0814544260334766e-07. But last 1010 items, which correspond to D2 are off by max 0.9. Below is gradient checking code:

print("Checking gradient:")
c,grad = cost(np.concatenate((np.asarray(theta1).flatten(),np.asarray(theta2).flatten())),x_tr,y_tr,theta1.shape,theta2.shape)
grad_approx = checkGrad(x_tr,y_tr,theta1,theta2)
print("Non zero in grad",np.count_nonzero(grad),np.count_nonzero(grad_approx))
tup_grad = np.nonzero(grad)
print("Original\n",grad[tup_grad[0][0:20]])
print("Numerical\n",grad_approx[tup_grad[0][0:20]])
wrong_grads = np.abs(grad-grad_approx)>0.1
print("Max diff:",np.abs(grad-grad_approx).max(),np.count_nonzero(wrong_grads),np.abs(grad-grad_approx)[0:6500].max())
print(np.squeeze(np.asarray(grad[wrong_grads]))[0:20])
print(np.squeeze(np.asarray(grad_approx[wrong_grads]))[0:20])
where_tup = np.where(wrong_grads)
print(where_tup[0][0:5],where_tup[0][-5:])

Check grad function:

def checkGrad(x,y,theta1,theta2):
eps = 0.0000001 #0.00001    
theta = np.concatenate((np.asarray(theta1).flatten(),np.asarray(theta2).flatten()))
gradApprox = np.zeros((len(theta,)))
thetaPlus = np.copy(theta)
thetaMinus = np.copy(theta)
print("Total iterations to be made",len(theta))
for i in range(len(theta)):
    if(i % 100 == 0):
        print("iteration",i)
    if(i != 0):
        thetaPlus[i-1] = thetaPlus[i-1]-eps
        thetaMinus[i-1] = thetaMinus[i-1]+eps
    thetaPlus[i] = theta[i]+eps
    thetaMinus[i] = theta[i]-eps
    cost1,grad1 = cost(thetaPlus,x,y,theta1.shape,theta2.shape)
    cost2,grad2 = cost(thetaMinus,x,y,theta1.shape,theta2.shape)
    gradApprox[i] = (cost1 - cost2)/(2*eps)

return gradApprox

I believe I am doing some rookie mistake. I realize it may be a lot of code to go through. But for someone with experience in the field may have some suggestions where I am doing mistake.

Where is the mistake?
why do i get very different results for scipy minimize using same algorithm (TNC). BFGS gave bad results.

EDIT: further clarity : I use checkGrad function to verify that the derivatives(D1 and D2) that I calculate for parameters theta1 and theta2 using backprop are correct. "lamda" (typo) is regularization constant 0.5. Expit is sigmoid function of numpy.

FULL CODE: https://www.kaggle.com/darkknight91/neuralnet-digits

Can you specify what your checkGrad function does? Same for expit? (Also, the letter "lambda" is written differently ;-) — dennlinger
:-D. sorry for confusion. I have added response to your queries. — nayakasu
I would actually like to see what you are calling internally. Is this scipy.optimize.check_grad? Additonally, where do you even set your theta values initially? — dennlinger
@dennlinger since it would be clumsy to post whole code here I have added a link to my full code. Theta is set to random values between 0 to 3. — nayakasu

Omar Cusma Fait Omar Cusma Fait · Accepted Answer · 2018-08-13T08:37:49

General method

If this is your first NN then I suggest you to compute the gradient using the network itself:

from copy import deepcopy as dc


def gradient(net, weights, dx=0.01):
    der = []
    # compute partial derivatives
    for i in range(len(weights)):
        w = dc(weights)     # dc is needed
        w[i] += dx
        der.append((net(w) - net(weights)) / dx)    # dy/dx
    return der

Here there's a simple example

def net(weights):
    """ this is your net """
    # example with just a simple cost function for clarity purposes
    return sum([i**2 for i in weights])

print('cost', net([1, 1]))
print('gradient =', gradient(net, [1, 1]))

Extra hint: make alway shure to have used deepcopy when needed, this took me a lot of time to figure out!

Neural network- derivative for second layer parameters wrong

1 Answers