3
votes

I am currently trying to do time series prediction with LSTM implemented with Keras.

I tried to train a LSTM model with 10 000 samples in the train and 2 500 samples in the test. I am using a batch size of 30.

Now, I am trying to train the exact same model but with more data. I have a train with 100 000 samples and test with 25 000 samples.

The time for one epoch is multiplicated by 100 when using the big dataset.

Even if I have more data, the size of the batch size is the same so the training should not be taking more time. Is it possible that this is the calculation of the loss on the train and test data that take a lot of time (here all the data is used) ?

Concerning the size of the batch size : should I put it higher because I have more data ?

EDIT 1

I tried to change the batch size and to put a bigger one. When I do that, the time of training decrease a lot. With a big batch size, the computation of the gradient should be longer than with a small batch size ?

I have no clue here, I really do not understand why this is happening.

Does someone know why this is happening ? Is it linked to the data I use ? How theorically can this happen ?

EDIT 2

My processor is Intel Xeon W3520 (4 hearts / 8 threads) with 32G of RAM. The data is composed of sequence of length 6 with 4 features. I use one LSMT layer with 50 units and a dense output layer. Whether I am training with 10 000 samples or 100 000 it is really the size of the batch size that change the time of computation. I can go from 2 seconds for one epoch with a batch size = 1000, to 200 seconds with a batch size = 30.

I do not use a generator, I use the basic line of code model.fit(Xtrain, Ytrain, nb_epoch, batch_size, verbose=2,callbacks,validation_data=(Xtest, Ytest)) with callbacks = [EarlyStopping(monitor='val_loss', patience=10, verbose=2), history]

1
the bottle neck was probably to feed the batch data to the computing unit... IO can be a bottelneck in neural networks so a bigger batch size means you feed more data , computing takes enough time to feed next batch so it becomes optimal. - Nassim Ben
Could you provide us information about the device you are using for your computations? Also the size of data might me useful (number of features, etc.) - Marcin Możejko
Could you also provide information about the way you feed this data? In memory or reading it with a generator? - Nassim Ben
thanks guys for your answers ! I edited my question with the information you aksed. I am going to research in the direction you gave me. - BenDes

1 Answers

1
votes

You seemingly have misunderstood parts of how SGD (Stochastic Gradient Descent) works. I explained parts of this answer in another post here on Stackoverflow, that might help you understand this better, but I'll take the time to explain it another time here.

The basic idea of Gradient Descent is to calculate the forward pass (and store the activations) of all trainig samples, and then afterwards update your weights once. Now, since you might not have enough memory to store all the activations (which you need for the calculation of your backpropagation gradient), and for other reasons (mainly convergence), you often cannot do classical gradient descent.

Stochastic Gradient Descent makes the assumption that, by sampling in a random order, you can reach convergence by looking at only one training sample at a time, and then updating directly after. This is called an iteration, whereas we call the pass through all training samples an epoch.
Mini batches now only change SGD by - instead of using one single sample - taking a "handful" of values, determined by the batch size.

Now, the updating of the weights is a quite costly process, and it should be clear at this point that updating the weights a great number of times (with SGD) is more costly than computing the gradient and updating only a few times (with a large batch size).