8
votes

Now base tensorflow-char-rnn I start a word-rnn project to predict the next word. But I found that speed is too slow in my train data set. Here is my training details:

  • Training data size: 1 billion words
  • Vocabulary size: 0.75 millions
  • RNN model: lstm
  • RNN Layer: 2
  • Cell size: 200
  • Seq length: 20
  • Batch size: 40 (too big batch size will be cause OOM exception)

The machine details:

  • Amazon p2 instance
  • 1 core K80 GPU
  • 16G video memory
  • 4 core CPU
  • 60G memory

In my test, the time of training data 1 epoch is need 17 days! It’s is really too slow, and then I change the seq2seq.rnn_decoder to tf.nn.dynamic_rnn, but the time is still 17 days.

I want to find the too slow reason is caused by my code or it has always been so slow? Because I heard some rumors that Tensorflow rnn is slower than other DL Framework.

This is my model code:

class SeqModel():
def __init__(self, config, infer=False):
    self.args = config
    if infer:
        config.batch_size = 1
        config.seq_length = 1

    if config.model == 'rnn':
        cell_fn = rnn_cell.BasicRNNCell
    elif config.model == 'gru':
        cell_fn = rnn_cell.GRUCell
    elif config.model == 'lstm':
        cell_fn = rnn_cell.BasicLSTMCell
    else:
        raise Exception("model type not supported: {}".format(config.model))

    cell = cell_fn(config.hidden_size)

    self.cell = cell = rnn_cell.MultiRNNCell([cell] * config.num_layers)

    self.input_data = tf.placeholder(tf.int32, [config.batch_size, config.seq_length])
    self.targets = tf.placeholder(tf.int32, [config.batch_size, config.seq_length])
    self.initial_state = cell.zero_state(config.batch_size, tf.float32)

    with tf.variable_scope('rnnlm'):
        softmax_w = tf.get_variable("softmax_w", [config.hidden_size, config.vocab_size])
        softmax_b = tf.get_variable("softmax_b", [config.vocab_size])

        embedding = tf.get_variable("embedding", [config.vocab_size, config.hidden_size])
        inputs = tf.nn.embedding_lookup(embedding, self.input_data)


    outputs, last_state = tf.nn.dynamic_rnn(cell, inputs, initial_state=self.initial_state)

    # [seq_size * batch_size, hidden_size]
    output = tf.reshape(tf.concat(1, outputs), [-1, config.hidden_size])

    self.logits = tf.matmul(output, softmax_w) + softmax_b
    self.probs = tf.nn.softmax(self.logits)

    self.final_state = last_state


    loss = seq2seq.sequence_loss_by_example([self.logits],
                                            [tf.reshape(self.targets, [-1])],
                                            [tf.ones([config.batch_size * config.seq_length])],
                                            config.vocab_size)
    self.cost = tf.reduce_sum(loss) / config.batch_size / config.seq_length

    self.lr = tf.Variable(0.0, trainable=False)
    tvars = tf.trainable_variables()
    grads, _ = tf.clip_by_global_norm(tf.gradients(self.cost, tvars),
                                      config.grad_clip)
    optimizer = tf.train.AdamOptimizer(self.lr)
    self.train_op = optimizer.apply_gradients(zip(grads, tvars))

Here is the GPU load during the training

Thanks very much.

3
64 days seems a bit too much, can you show the code?sygi
Are you using the Google Billion Words dataset?chris
@sygi The model code is above. I decrease the vocab size to 0.75 million(1.5m before), and change the batch size to 40(15 before), seq length to 20(25 before), so I can move the word embedding to GPU(OOM before). But it's still need 17 days per epoch.Johnny K
@helloChris No, the dataset is from my company.Johnny K
You may want to take a look here: static.googleusercontent.com/media/research.google.com/en//pubs/… They list training times that could maybe help you get an idea of how long something should take for 1 billion words. They do have half the vocab size. It could just be that your data is huge. I wouldn't blame TensorFlow until you replicate the model in another framework and it takes way less time.chris

3 Answers

5
votes

As you mentionned batch_size is really important to tune, it can lead to impressive speedup but check that your perplexity keeps relevant.

Monitoring your GPU activity can you give you hints about potential I/O bottleneck.

Most importantly, using sampled softmax instead of regular softmax is way faster. This would require you to use a [config.vocab_size, config.hidden_size] weight matrix instead of you [config.hidden_size, config.vocab_size]. This is definitely the way to go to my point of view.

Hope this helps.

pltrdy

3
votes

One other possible way you can speed up training, and the possible reason for your lack of utilisation of the GPU, is you are using placeholders. You should be using queues, if using Tensorflow < 1.2, and the tf.contrib.data module otherwise.

https://www.tensorflow.org/programmers_guide/threading_and_queues

0
votes

Here are 2 lines of code that sped up my execution.

tf.compat.v1.disable_eager_execution()
tf.config.optimizer.set_jit(True)

See here for eager execution, and here for jit to judge if it will help in your case.