3
votes

I'm diving into LSTM RNN with Keras and Theano backend. While trying to use lstm examples from keras' repo whole code of lstm_text_generation.py on github, I've got one thing that isn't pretty clear to me: the way it's vectorizing the input data (text characters):

# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

#np - means numpy
print('Vectorization...')
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

Here, as you can see, they generate lists of zeros with Numpy and then put '1' to particular position of each list defined by input characters encoding sequences in such way.

The question is: why did they use that algorithm? is it possible to optimize it somehow? maybe it's possible to encode input data in some other way, not using huge lists of lists? The problem is that it has severe limits of input data: generating such vectors for >10 Mb text causes MemoryError of Python (dozens of Gbs RAM needed to process it!).

Thanks in advance, guys.

1
What kind of dimensions are we talking about (roughly how big are len(sentences) and len(chars) for your dataset)? How much RAM do you have?ali_m
I have 6Gb RAM, but I also tried run it on 32Gb RAM vps. As for dimensions: for 520Kb of input text they are len(sentences)=174507 and len(chars)=74 , and all runs OK. But for 17Mb of input text they are len(sentences)=5853627 and len(chars)=74 and MemoryError throws on 6Gb RAM .Alex M

1 Answers

2
votes

There are at least two optimizations in Keras which you could use in order to decrease amount of memory which is need in this case:

  1. An Embedding layer which makes it possible to accept only a single integer intead of full one hot vector. Moreover - this layer could be pretrained before the final stage of network training - so you could inject some prior knowledge into your model (and even finetune it during the network fitting).

  2. A fit_generator method makes it possible to train a network using a predefinied generator which would produce pairs (x, y) need in network fitting. You could e.g. save the whole dataset to disk and read it part by part using a generator interface.

Of course - both of this methods could be mixed. I think that simplicity was the reason behind this kind of implementation in the example you provided.