39
votes

My data can be viewed as a matrix of 10B entries (100M x 100), which is very sparse (< 1/100 * 1/100 of entries are non-zero). I would like to feed the data into into a Keras Neural Network model which I have made, using a Tensorflow backend.

My first thought was to expand the data to be dense, that is, write out all 10B entries into a series of CSVs, with most entries zero. However, this is quickly overwhelming my resources (even doing the ETL overwhelmed pandas and is causing postgres to struggle). So I need to use true sparse matrices.

How can I do that with Keras (and Tensorflow)? While numpy doesn't support sparse matrices, scipy and tensorflow both do. There's lots of discussion (e.g. https://github.com/fchollet/keras/pull/1886 https://github.com/fchollet/keras/pull/3695/files https://github.com/pplonski/keras-sparse-check https://groups.google.com/forum/#!topic/keras-users/odsQBcNCdZg ) about this idea - either using scipy's sparse matrixcs or going directly to Tensorflow's sparse matrices. But I can't find a clear conclusion, and I haven't been able to get anything to work (or even know clearly which way to go!).

How can I do this?

I believe there are two possible approaches:

  1. Keep it as a scipy sparse matrix, then, when giving Keras a minibatch, make it dense
  2. Keep it sparse all the way through, and use Tensorflow Sparse Tensors

I also think #2 is preferred, because you'll get much better performance all the way through (I believe), but #1 is probably easier and will be adequate. I'll be happy with either.

How can either be implemented?

2

2 Answers

17
votes

Sorry, don't have the reputation to comment, but I think you should take a look at the answer here: Keras, sparse matrix issue. I have tried it and it works correctly, just one note though, at least in my case, the shuffling led to really bad results, so I used this slightly modified non-shuffled alternative:

def nn_batch_generator(X_data, y_data, batch_size):
    samples_per_epoch = X_data.shape[0]
    number_of_batches = samples_per_epoch/batch_size
    counter=0
    index = np.arange(np.shape(y_data)[0])
    while 1:
        index_batch = index[batch_size*counter:batch_size*(counter+1)]
        X_batch = X_data[index_batch,:].todense()
        y_batch = y_data[index_batch]
        counter += 1
        yield np.array(X_batch),y_batch
        if (counter > number_of_batches):
            counter=0

It produces comparable accuracies to the ones achieved by keras's shuffled implementation (setting shuffle=True in fit).

5
votes

This answer addresses the second approach mentioned in the question. It is possible to use sparse matrices as inputs to a Keras model with the Tensorflow backend if you write a custom training loop. In the example below, the model takes a sparse matrix as an input and outputs a dense matrix.

from keras.layers import Dense, Input
from keras.models import Model
import scipy
import numpy as np

trainX = scipy.sparse.random(1024, 1024)
trainY = np.random.rand(1024, 1024)

inputs = Input(shape=(trainX.shape[1],), sparse=True)
outputs = Dense(trainY.shape[1], activation='softmax')(inputs)
model = Model(inputs=inputs, outputs=outputs)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

steps = 10
for i in range(steps):
  # For simplicity, we directly use trainX and trainY in this example
  # Usually, this is where batches are prepared
  print(model.train_on_batch(trainX, trainY))
# [3549.2546, 0.0]
# ...
# [3545.6448, 0.0009765625]

However, the usefulness of this approach depends on whether your model needs to densify the sparse matrix. Indeed, the above model has one layer which transforms the sparse matrix into a dense one. This can be a problem if your sparse matrix doesn't fit in memory.