3
votes

I'm currently using a Naive Bayes algorithm to do my text classification.

My end goal is to be able to highlight parts of a big text document if the algorithm has decided the sentence belonged to a category.

Naive Bayes results are good, but I would like to train a NN for this problem, so I've followed this tutorial: http://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/ to build my LSTM network on Keras.

All these notions are quite difficult for me to understand right now, so excuse me if you see some really stupid things in my code.

1/ Preparation of the training data

I have 155 sentences of different sizes that have been tagged to a label.

All these tagged sentences are in a training.csv file:

8,9,1,2,3,4,5,6,7
16,15,4,6,10,11,12,13,14
17,18
22,19,20,21
24,20,21,23

(each integer representing a word)

And all the results are in another label.csv file:

6,7,17,15,16,18,4,27,30,30,29,14,16,20,21 ...

I have 155 lines in trainings.csv, and of course 155 integers in label.csv

My dictionnary has 1038 words.

2/ The code

Here is my current code:

total_words = 1039

## fix random seed for reproducibility
numpy.random.seed(7)


datafile = open('training.csv', 'r')
datareader = csv.reader(datafile)
data = []
for row in datareader:
    data.append(row)



X = data;
Y = numpy.genfromtxt("labels.csv", dtype="int", delimiter=",")

max_sentence_length = 500

X_train = sequence.pad_sequences(X, maxlen=max_sentence_length)
X_test = sequence.pad_sequences(X, maxlen=max_sentence_length)


# create the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(total_words, embedding_vecor_length, input_length=max_sentence_length))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, Y, epochs=3, batch_size=64)
# Final evaluation of the model
scores = model.evaluate(X_train, Y, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

This model is never converging:

155/155 [==============================] - 4s - loss: 0.5694 - acc: 0.0000e+00     
Epoch 2/3
155/155 [==============================] - 3s - loss: -0.2561 - acc: 0.0000e+00     
Epoch 3/3
155/155 [==============================] - 3s - loss: -1.7268 - acc: 0.0000e+00  

I would like to have one of the 24 labels as a result, or a list of probabilities for each label.

What am I doing wrong here?

Thanks for your help!

1
I can't comment.. so i'll leave this as an answer: This may be helpful stackoverflow.com/questions/37543132/…Mancento
The problem is that you categories (Y) are not binary. binary cross entropy is used for two category classifications where the Y values are binary.DJK

1 Answers

4
votes

I've updated my code thanks to the great comments posted to my question.

Y_train = numpy.genfromtxt("labels.csv", dtype="int", delimiter=",")
Y_test = numpy.genfromtxt("labels_test.csv", dtype="int", delimiter=",")
Y_train =  np_utils.to_categorical(Y_train)
Y_test = np_utils.to_categorical(Y_test)
max_review_length = 50

X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)


model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_review_length))
model.add(LSTM(10, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(31, activation="softmax"))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"])

model.fit(X_train, Y_train, epochs=100, batch_size=30)

I think I can play with LSTM size (10 or 100), number of epochs and batch size.

Model has a very poor accuracy (40%). But currently I think it's because I don't have enough data (150 sentences for 24 labels).

I will put this project in standby mode until I get more data.

If someone has some ideas to improve this code, feel free to comment!