python - Training broke with ResourceExausted error

Question

I am new to tensorflow and Machine Learning. Recently I am working on a model. My model is like below,

Character level Embedding Vector -> Embedding lookup -> LSTM1
Word level Embedding Vector->Embedding lookup -> LSTM2
[LSTM1+LSTM2] -> single layer MLP-> softmax layer
[LSTM1+LSTM2] -> Single layer MLP-> WGAN discriminator
Code of he rnn model

while I'm working on this model I got the following error. I thought My batch is too big. Thus I tried to reduce the batch size from 20 to 10 but it doesn't work.

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[24760,100] [[Node: chars/bidirectional_rnn/bw/bw/while/bw/lstm_cell/split = Split[T=DT_FLOAT, num_split=4, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients_2/Add_3/y, chars/bidirectional_rnn/bw/bw/while/bw/lstm_cell/BiasAdd)]] [[Node: bi-lstm/bidirectional_rnn/bw/bw/stack/_167 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_636_bi-lstm/bidirectional_rnn/bw/bw/stack", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

tensor with shape[24760,100] means 2476000*32/8*1024*1024 = 9.44519043 MB memory. I am running the code on a titan X(11 GB) GPU. What could go wrong? Why this type of error occurred?

* Extra info *: the size of the LSTM1 is 100. for bidirectional LSTM it becomes 200. The size of the LSTM2 is 300. For Bidirectional LSTM it becomes 600.

*Note *: The error occurred after 32 epoch. My question is why after 32 epoch there is an error. Why not at the initial epoch.

Possible duplicate of Tensorflow Deep MNIST: Resource exhausted: OOM when allocating tensor with shape[10000,32,28,28] — Niyamat Ullah
I didn't encounter any "W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 957.03MiB. See logs for memory state." this type of error but the solution seems similar and also see the Note below. — Maruf

Maruf Maruf · Accepted Answer · 2018-01-04T13:43:22

I have been tweaking a lot these days to solve this problem.

Finally, I haven't solved the mystery of the memory size described in the question. I guess while computing the gradient tensoflow accumulate a lot of additional memory for computing gradient. I need to check the source of the tensorflow which seems very cumbersome at this time. You can check how much memory your model is using from terminal by the following command,

nvidia-smi

judging from this command you can guess how much additional memory you can use.

But the solution to these type of problem lies on reducing the batch size,

For my case reducing the size of the batch to 3 works. This may vary model to model.

But what if you are using a model where the embedding matrix is much bigger that you cannot load them into memory?

The solution is to write some painy code.

You have to lookup on the embedding matrix and then load the embedding to the model. In short, for each batch, you have to give the lookup matrixes to the model(feed them by the feed_dict argument in the sess.run()).

Next you will face a new problem,

You cannot make the embeddings trainable in this way. The solution is to use the embedding in a placeholder and assign them to a Variable(say for example A). After each batch of training, the learning algorithm updates the variable A. Then compute the output of A vector by tensorflow and assign them to your embedding matrix which is outside of the model. (I said that the process is painy)

Now your next question should be, what if you cannot feed the embedding lookup to the model because it's so big. This is a fundamental problem that you cannot avoid. That's why the NVIDIA GTX 1080, 1080ti and NVIDA TITAN Xp have so price difference though NVIDIA 1080ti and 1080 have the higher frequency to run an execution.

python - Training broke with ResourceExausted error

2 Answers