26
votes

I am trying to feed a huge sparse matrix to Keras model. As the dataset doesn`t fit into RAM, the way around is to train the model on a data generated batch-by-batch by a generator.

To test this approach and make sure my solution works fine, I slightly modified a Kera`s simple MLP on the Reuters newswire topic classification task. So, the idea is to compare original and edited models. I just convert numpy.ndarray into scipy.sparse.csr.csr_matrix and feed it to the model.

But my model crashes at some point and I need a hand to figure out a reason.

Here is the original model and my additions below

from __future__ import print_function
import numpy as np
np.random.seed(1337)  # for reproducibility

from keras.datasets import reuters
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.utils import np_utils
from keras.preprocessing.text import Tokenizer

max_words = 1000
batch_size = 32
nb_epoch = 5

print('Loading data...')
(X_train, y_train), (X_test, y_test) = reuters.load_data(nb_words=max_words, test_split=0.2)
print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')

nb_classes = np.max(y_train)+1
print(nb_classes, 'classes')

print('Vectorizing sequence data...')
tokenizer = Tokenizer(nb_words=max_words)
X_train = tokenizer.sequences_to_matrix(X_train, mode='binary')
X_test = tokenizer.sequences_to_matrix(X_test, mode='binary')
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

print('Convert class vector to binary class matrix (for use with categorical_crossentropy)')
Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)
print('Y_train shape:', Y_train.shape)
print('Y_test shape:', Y_test.shape)


print('Building model...')
model = Sequential()
model.add(Dense(512, input_shape=(max_words,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
          optimizer='adam',
          metrics=['accuracy'])

history = model.fit(X_train, Y_train,
                nb_epoch=nb_epoch, batch_size=batch_size,
                verbose=1)#, validation_split=0.1)
#score = model.evaluate(X_test, Y_test,
#                       batch_size=batch_size, verbose=1)
print('Test score:', score[0])
print('Test accuracy:', score[1])

It outputs:

Loading data...  
8982 train sequences  
2246 test sequences  
46 classes  
Vectorizing sequence data...  
X_train shape: (8982, 1000)  
X_test shape: (2246, 1000)  
Convert class vector to binary class matrix (for use with   categorical_crossentropy)  
Y_train shape: (8982, 46)  
Y_test shape: (2246, 46)  
Building model...  
Epoch 1/5
8982/8982 [==============================] - 5s - loss: 1.3932 - acc: 0.6906     
Epoch 2/5
8982/8982 [==============================] - 4s - loss: 0.7522 - acc: 0.8234     
Epoch 3/5
8982/8982 [==============================] - 5s - loss: 0.5407 - acc: 0.8681     
Epoch 4/5
8982/8982 [==============================] - 5s - loss: 0.4160 - acc: 0.8980     
Epoch 5/5
8982/8982 [==============================] - 5s - loss: 0.3338 - acc: 0.9136     
Test score: 1.01453569163
Test accuracy: 0.797417631398

Finally, here is my part

X_train_sparse = sparse.csr_matrix(X_train)

def batch_generator(X, y, batch_size):
    n_batches_for_epoch = X.shape[0]//batch_size
    for i in range(n_batches_for_epoch):
        index_batch = range(X.shape[0])[batch_size*i:batch_size*(i+1)]       
        X_batch = X[index_batch,:].todense()
        y_batch = y[index_batch,:]
        yield(np.array(X_batch),y_batch)

model.fit_generator(generator=batch_generator(X_train_sparse, Y_train, batch_size),
                    nb_epoch=nb_epoch, 
                    samples_per_epoch=X_train_sparse.shape[0])

The crash:

Exception                                 Traceback (most recent call last)
<ipython-input-120-6722a4f77425> in <module>()  
      1 model.fit_generator(generator=batch_generator(X_trainSparse, Y_train, batch_size),  
      2                     nb_epoch=nb_epoch,
----> 3                     samples_per_epoch=X_trainSparse.shape[0])  

/home/kk/miniconda2/envs/tensorflow/lib/python2.7/site-packages/keras/models.pyc in fit_generator(self, generator, samples_per_epoch, nb_epoch, verbose, callbacks, validation_data, nb_val_samples, class_weight, max_q_size, **kwargs)  
    648                                         nb_val_samples=nb_val_samples,  
    649                                         class_weight=class_weight,  
--> 650                                         max_q_size=max_q_size)  
    651   
    652     def evaluate_generator(self, generator, val_samples, max_q_size=10, **kwargs):  

/home/kk/miniconda2/envs/tensorflow/lib/python2.7/site-packages/keras/engine/training.pyc in fit_generator(self, generator, samples_per_epoch, nb_epoch, verbose, callbacks, validation_data, nb_val_samples, class_weight, max_q_size)  
   1356                     raise Exception('output of generator should be a tuple '  
   1357                                     '(x, y, sample_weight) '  
-> 1358                                     'or (x, y). Found: ' + str(generator_output))  
   1359                 if len(generator_output) == 2:  
   1360                     x, y = generator_output  

Exception: output of generator should be a tuple (x, y, sample_weight) or (x, y). Found: None  

I believe the problem is due to wrong setup of samples_per_epoch. I`d trully appreciate if someone could comment on this.

2
The error tells you what's wrong. the batch-generator is not outputting anything. And the general setup doesn't look right to me too. I would expect a while 1 loop so that the generator can be called an infinite number of times. With your approach the generator will be empty at some time. And one more thing: when you finished this, you might be running into trouble because your generator is not thread-safe. Look into keras issues for help on that!sascha
Yep, it`s clear that batch_generator outputs None, but why it happens...Probably, I'm missing something but why it should be an infinite loop? Per my understanding, the loop should be stopped as far as it went through (almost) all samples (which is the end of an epoch). That's why I use "for i in range(n_batches_for_epoch)". Actually, n_batches_for_epoch is number of iterations.Kirk
@Kirk I getting what I believe to be the same exact issue. Did you find a way to solve yours?BigBoy1337
Sort of. I was wrong -- the loop should be indefinite as the generator is called only once for all epoch.Kirk

2 Answers

24
votes

Here is my solution.

def batch_generator(X, y, batch_size):
    number_of_batches = samples_per_epoch/batch_size
    counter=0
    shuffle_index = np.arange(np.shape(y)[0])
    np.random.shuffle(shuffle_index)
    X =  X[shuffle_index, :]
    y =  y[shuffle_index]
    while 1:
        index_batch = shuffle_index[batch_size*counter:batch_size*(counter+1)]
        X_batch = X[index_batch,:].todense()
        y_batch = y[index_batch]
        counter += 1
        yield(np.array(X_batch),y_batch)
        if (counter < number_of_batches):
            np.random.shuffle(shuffle_index)
            counter=0

In my case, X - sparse matrix, y - array.

-1
votes

If you can use Lasagne instead of Keras I've written a small MLP class with the following features:

supports both dense and sparse matrices

supports drop-out and hidden layer

Supports complete probability distribution instead of one-hot labels so supports multilabel training.

Supports scikit-learn like API (fit, predict, accuracy, etc.)

Is very easy to configure and modify