I'm currently using a Naive Bayes algorithm to do my text classification.
My end goal is to be able to highlight parts of a big text document if the algorithm has decided the sentence belonged to a category.
Naive Bayes results are good, but I would like to train a NN for this problem, so I've followed this tutorial: http://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/ to build my LSTM network on Keras.
All these notions are quite difficult for me to understand right now, so excuse me if you see some really stupid things in my code.
1/ Preparation of the training data
I have 155 sentences of different sizes that have been tagged to a label.
All these tagged sentences are in a training.csv file:
8,9,1,2,3,4,5,6,7
16,15,4,6,10,11,12,13,14
17,18
22,19,20,21
24,20,21,23
(each integer representing a word)
And all the results are in another label.csv file:
6,7,17,15,16,18,4,27,30,30,29,14,16,20,21 ...
I have 155 lines in trainings.csv, and of course 155 integers in label.csv
My dictionnary has 1038 words.
2/ The code
Here is my current code:
total_words = 1039
## fix random seed for reproducibility
numpy.random.seed(7)
datafile = open('training.csv', 'r')
datareader = csv.reader(datafile)
data = []
for row in datareader:
data.append(row)
X = data;
Y = numpy.genfromtxt("labels.csv", dtype="int", delimiter=",")
max_sentence_length = 500
X_train = sequence.pad_sequences(X, maxlen=max_sentence_length)
X_test = sequence.pad_sequences(X, maxlen=max_sentence_length)
# create the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(total_words, embedding_vecor_length, input_length=max_sentence_length))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, Y, epochs=3, batch_size=64)
# Final evaluation of the model
scores = model.evaluate(X_train, Y, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
This model is never converging:
155/155 [==============================] - 4s - loss: 0.5694 - acc: 0.0000e+00
Epoch 2/3
155/155 [==============================] - 3s - loss: -0.2561 - acc: 0.0000e+00
Epoch 3/3
155/155 [==============================] - 3s - loss: -1.7268 - acc: 0.0000e+00
I would like to have one of the 24 labels as a result, or a list of probabilities for each label.
What am I doing wrong here?
Thanks for your help!