0
votes

I am working on a document classification problem using CNN/LSTM and embeddings generated from universal sentence encoder. I have 10,000 records and each record has about 100~600 sentences. I save all the document matrices into one json file before I feed them into the neural network models. The overall json file is about 20GB which will take too much memory.

I'm not sure if I should save documents in text format and convert them into sentence embeddings during the training process. What's the potential solution?

1
I somehow have the feeling that JSON isn't doing a good job at that, would some binary format like numpy work? HDF5 is commonly used for images, you could try that as well. You would probably store each document's embeddings in a separate file, but I think converting ahead of training will definitely help you a lot. - Jan
@Jan Thank you for your advice. I save embeddings in separate pickle file and it sloves the problems. I also tried hdf5 but it seems there is not much difference in my case. - Qian

1 Answers

0
votes

Providing the Solution in this section (even though it is present in Comments section), for the benefit of the Community.

Saving the Embeddings in a Separate Pickle File has resolved the problem.