Tensorflow lstm for sentiment analysis not learning. UPDATED

Question

UPDATED:

i'm building a Neural Network for my final project and i need some help with it.

I'm trying to build a rnn to do sentiment analysis over Spanish text. I have about 200,000 labeled tweets and i vectorized them using a word2vec with a Spanish embedding

Dataset & Vectorization:

I erased duplicates and split the dataset into training and testing sets.
Padding, unknown and end of sentence tokens are applied when vectorizing.
I mapped the @mentions to known names in the word2vec model. Example: @iamthebest => "John"

My model:

My data tensor has shape = (batch_size, 20, 300).
I have 3 classes: neutral, positive and negative, so my target tensor has shape = (batch_size, 3)
I use BasicLstm cells and dynamic rnn to build the net.
I use Adam Optimizer, and softmax_cross entropy for the loss calculation
I use a dropout wrapper to decrease the overfitting.

Last run:

I have tried with different configurations and non of them seem to work.
Last setup: 2 Layers, 512 batch size, 15 epochs and 0.001 of lr.

Weak points for me:

im worried about the final layer and the handing of the final state in the dynamic_rnn

Code:

# set variables
num_epochs = 15
tweet_size = 20
hidden_size = 200
vec_size = 300
batch_size = 512
number_of_layers= 1
number_of_classes= 3
learning_rate = 0.001

TRAIN_DIR="/checkpoints"

tf.reset_default_graph()

# Create a session 
session = tf.Session()

# Inputs placeholders
tweets = tf.placeholder(tf.float32, [None, tweet_size, vec_size], "tweets")
labels = tf.placeholder(tf.float32, [None, number_of_classes], "labels")

# Placeholder for dropout
keep_prob = tf.placeholder(tf.float32) 

# make the lstm cells, and wrap them in MultiRNNCell for multiple layers
def lstm_cell():
    cell = tf.contrib.rnn.BasicLSTMCell(hidden_size)
    return tf.contrib.rnn.DropoutWrapper(cell=cell, output_keep_prob=keep_prob)

multi_lstm_cells = tf.contrib.rnn.MultiRNNCell([lstm_cell() for _ in range(number_of_layers)], state_is_tuple=True)

# Creates a recurrent neural network
outputs, final_state = tf.nn.dynamic_rnn(multi_lstm_cells, tweets, dtype=tf.float32)

with tf.name_scope("final_layer"):
    # weight and bias to shape the final layer
    W = tf.get_variable("weight_matrix", [hidden_size, number_of_classes], tf.float32, tf.random_normal_initializer(stddev=1.0 / math.sqrt(hidden_size)))
    b = tf.get_variable("bias", [number_of_classes], initializer=tf.constant_initializer(1.0))

    sentiments = tf.matmul(final_state[-1][-1], W) + b

prob = tf.nn.softmax(sentiments)
tf.summary.histogram('softmax', prob)

with tf.name_scope("loss"):
    # define cross entropy loss function
    losses = tf.nn.softmax_cross_entropy_with_logits(logits=sentiments, labels=labels)
    loss = tf.reduce_mean(losses)
    tf.summary.scalar("loss", loss)

with tf.name_scope("accuracy"):
    # round our actual probabilities to compute error
    accuracy = tf.to_float(tf.equal(tf.argmax(prob,1), tf.argmax(labels,1)))
    accuracy = tf.reduce_mean(tf.cast(accuracy, dtype=tf.float32))
    tf.summary.scalar("accuracy", accuracy)

# define our optimizer to minimize the loss
with tf.name_scope("train"):
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)

#tensorboard summaries
merged_summary = tf.summary.merge_all()
logdir = "tensorboard/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S") + "/"
writer = tf.summary.FileWriter(logdir, session.graph)

# initialize any variables
tf.global_variables_initializer().run(session=session)

# Create a saver for writing training checkpoints.
saver = tf.train.Saver()

# load our data and separate it into tweets and labels
train_tweets = np.load('data_es/train_vec_tweets.npy')
train_labels = np.load('data_es/train_vec_labels.npy')

test_tweets = np.load('data_es/test_vec_tweets.npy')
test_labels = np.load('data_es/test_vec_labels.npy')

**HERE I HAVE THE LOOP FOR TRAINING AND TESTING, I KNOW ITS FINE**

I wonder how you formatted your data. You have 20 words per-tweet. Does every tweet have exactly 20 words? Did you make use of padding? If so your accuracy and loss must be masked from padded words. And LSTM too must be provided with a sequence length for performances. Let us know. — Giuseppe Marra
The tweets are variable length. I took each tweet from the dataset, tokenized the words and i use the word2vec model to vectorized them, if the word was not in the model vocabulary i generate a random vector with the same shape as the model and in the interval (-0.25,0.25). And i padded each tweet with the zeros vector to reach the max length (20). Is that ok? — SiM

SiM SiM · Accepted Answer · 2017-11-01T08:27:01

I have already solved my problem. After reading some papers and more trial and error, I figured out what my mistakes were.

1) Dataset: I had a large dataset, but I didn't format it properly.

I checked the distribution of tweet labels (Neutral, Positive and Negative), realized there was a disparity in the distribution of said tweets and normalized it.
I cleaned it up even more by erasing url hashtags and unnecessary punctuation.
I shuffled prior to vectorization.

2) Initialization:

I initialized the MultiRNNCell with zeros and I changed my custom final layer to tf.contrib.fully_connected. I also added the initialization of the bias and weight matrix. (By fixing this, I started to see better loss and accuracy plots in Tensorboard)

3) Dropout:

I read this paper, Recurrent Dropout without Memory Loss, and I changed my dropouts accordingly; I started seeing improvements in the loss and accuracy.

4) Decaying the learning rate:

I added an exponential decaying rate after 10,000 steps to control over-fitting.

Final results:

After applying all of these changes, I achieved a test accuracy of 84%, which is acceptable because my data set still sucks.

My final network config was:

num_epochs = 20
tweet_size = 20
hidden_size = 400
vec_size = 300
batch_size = 512
number_of_layers= 2
number_of_classes= 3
start_learning_rate = 0.001

Tensorflow lstm for sentiment analysis not learning. UPDATED

UPDATED:

1 Answers