How efficient/intelligent is Theano in computing gradients?

Question

Suppose I have an artificial neural networks with 5 hidden layers. For the moment, forget about the details of the neural network model such as biases, the activation functions used, type of data and so on ... . Of course, the activation functions are differentiable.

With symbolic differentiation, the following computes the gradients of the objective function with respect to the layers' weights:

w1_grad = T.grad(lost, [w1])
w2_grad = T.grad(lost, [w2])
w3_grad = T.grad(lost, [w3])
w4_grad = T.grad(lost, [w4])
w5_grad = T.grad(lost, [w5])
w_output_grad = T.grad(lost, [w_output])

This way, to compute the gradients w.r.t w1 the gradients w.r.t w2, w3, w4 and w5 must first be computed. Similarly to compute the gradients w.r.t w2 the gradients w.r.t w3, w4 and w5 must be computed first.

However, I could the following code also computes the gradients w.r.t to each weight matrix:

w1_grad, w2_grad, w3_grad, w4_grad, w5_grad, w_output_grad = T.grad(lost, [w1, w2, w3, w4, w5, w_output])

I was wondering, is there any difference between these two methods in terms of performance? Is Theano intelligent enough to avoid re-computing the gradients using the second method? By intelligent I mean to compute w3_grad, Theano should [preferably] use the pre-computed gradients of w_output_grad, w5_grad and w4_grad instead of computing them again.

Interesting question. Have you simply tried out and measured the runtime for both methods? — hbaderts
@hbaderts Not yet, I will do some experiments and post the performance results here. — Amir
Amir, you've been editing new tags (jacobean, hessian) into a bunch of old questions, but I'm not convinced the tags are necessary - I think they are "meta tags", as they can't really stand on their own. They certainly aren't important enough to justify tag-only edits on old posts that have lots of other problems. I suggest you make a meta post to open a discussion about whether the tags are welcome; if there is community support for adding them, it won't need to be a one-man effort. — Mogsdad
@Amir And we aren't saying categorically that they shouldn't ever exist here, merely that you start a discussion on meta, because we are hesitant to agree with you about the meta status of such tags, as it were. If you can provide a compelling argument there, along with a good tag description that offers accurate usage guidance (where currently there is none), then as Mogsdad said, others can even assist you with adding these tags to appropriate questions. — TylerH

Amir Amir · Accepted Answer · 2015-12-27T21:51:47

Well it turns out Theano does not take the previously-computed gradients to compute the gradients in lower layers of a computational graph. Here's a dummy example of a neural network with 3 hidden layers and an output layer. However, it's not going to be a big deal at all since computing the gradients is a once-in-a-life-time operation unless you have to compute the gradient on each iteration. Theano returns a symbolic expression for the derivatives as a computational graph and you can simply use it as a function from that point on. From that point on we simply use the function derived by Theano to compute numerical values and update the weights using those.

import theano.tensor as T
import time
import numpy as np
class neuralNet(object):
    def __init__(self, examples, num_features, num_classes):
        self.w = shared(np.random.random((16384, 5000)).astype(T.config.floatX), borrow = True, name = 'w')
        self.w2 = shared(np.random.random((5000, 3000)).astype(T.config.floatX), borrow = True, name = 'w2')
        self.w3 = shared(np.random.random((3000, 512)).astype(T.config.floatX), borrow = True, name = 'w3')
        self.w4 = shared(np.random.random((512, 40)).astype(T.config.floatX), borrow = True, name = 'w4')
        self.b = shared(np.ones(5000, dtype=T.config.floatX), borrow = True, name = 'b')
        self.b2 = shared(np.ones(3000, dtype=T.config.floatX), borrow = True, name = 'b2')
        self.b3 = shared(np.ones(512, dtype=T.config.floatX), borrow = True, name = 'b3')
        self.b4 = shared(np.ones(40, dtype=T.config.floatX), borrow = True, name = 'b4')
        self.x = examples

        L1 = T.nnet.sigmoid(T.dot(self.x, self.w) + self.b)
        L2 = T.nnet.sigmoid(T.dot(L1, self.w2) + self.b2)
        L3 = T.nnet.sigmoid(T.dot(L2, self.w3) + self.b3)
        L4 = T.dot(L3, self.w4) + self.b4
        self.forwardProp = T.nnet.softmax(L4)
        self.predict = T.argmax(self.forwardProp, axis = 1)

    def loss(self, y):
        return -T.mean(T.log(self.forwardProp)[T.arange(y.shape[0]), y])

x = T.matrix('x')
y = T.ivector('y')

nnet = neuralNet(x)
loss = nnet.loss(y)

diffrentiationTime = []
for i in range(100):
    t1 = time.time()
    gw, gw2, gw3, gw4, gb, gb2, gb3, gb4 = T.grad(loss, [nnet.w, nnet.w2, logReg.w3, nnet.w4, nnet.b, nnet.b2, nnet.b3, nnet.b4])
    diffrentiationTime.append(time.time() - t1)
print 'Efficient Method: Took %f seconds with std %f' % (np.mean(diffrentiationTime), np.std(diffrentiationTime))

diffrentiationTime = []
for i in range(100):
    t1 = time.time()
    gw = T.grad(loss, [nnet.w])
    gw2 = T.grad(loss, [nnet.w2])
    gw3 = T.grad(loss, [nnet.w3])
    gw4 = T.grad(loss, [nnet.w4])
    gb = T.grad(loss, [nnet.b])
    gb2 = T.grad(loss, [nnet.b2])
    gb3 = T.grad(loss, [nnet.b3])
    gb4 = T.grad(loss, [nnet.b4])
    diffrentiationTime.append(time.time() - t1)
print 'Inefficient Method: Took %f seconds with std %f' % (np.mean(diffrentiationTime), np.std(diffrentiationTime))

This will print out the followings:

Efficient Method: Took 0.061056 seconds with std 0.013217
Inefficient Method: Took 0.305081 seconds with std 0.026024

This shows that Theano uses a dynamic-programming approach to compute gradients for the efficient method.

How efficient/intelligent is Theano in computing gradients?

1 Answers