1
votes

I am trying to use a CNN architecture to classify text sentences. The architecture of the network is as follows:

text_input = Input(shape=X_train_vec.shape[1:], name = "Text_input")

conv2 = Conv1D(filters=128, kernel_size=5, activation='relu')(text_input)
drop21 = Dropout(0.5)(conv2)
pool1 = MaxPooling1D(pool_size=2)(drop21)
conv22 = Conv1D(filters=64, kernel_size=5, activation='relu')(pool1)
drop22 = Dropout(0.5)(conv22)
pool2 = MaxPooling1D(pool_size=2)(drop22)
dense = Dense(16, activation='relu')(pool2)

flat = Flatten()(dense)
dense = Dense(128, activation='relu')(flat)
out = Dense(32, activation='relu')(dense)

outputs = Dense(y_train.shape[1], activation='softmax')(out)

model = Model(inputs=text_input, outputs=outputs)
# compile
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

I have some callbacks as early_stopping and reduceLR to stop the training and to reduce the learning rate when the validation loss is not improving (reducing).

early_stopping = EarlyStopping(monitor='val_loss', 
                               patience=5)
model_checkpoint = ModelCheckpoint(filepath=checkpoint_filepath,
                                   save_weights_only=False,
                                   monitor='val_loss',
                                   mode="auto",
                                   save_best_only=True)
learning_rate_decay = ReduceLROnPlateau(monitor='val_loss', 
                                        factor=0.1, 
                                        patience=2, 
                                        verbose=1, 
                                        mode='auto',
                                        min_delta=0.0001, 
                                        cooldown=0,
                                        min_lr=0)

Once the model is trained the history of the training goes as follows: enter image description here

We can observe here that the validation loss is not improving from epoch 5 on and that the training loss is being overfitted with each step.

I will like to know if I'm doing something wrong in the architecture of the CNN? Aren't enough the dropout layers to avoid the overfitting? Which are other ways to reduce overfitting?

Any suggestion?

Thanks in advance.


Edit:

I have tried also with regularization an the result where even worse:

kernel_regularizer=l2(0.01), bias_regularizer=l2(0.01)

enter image description here


Edit 2:

I have tried to apply BatchNormalization layers after each convolution and the result is the next one:

norm = BatchNormalization()(conv2)

enter image description here


Edit 3:

After applying the LSTM architecture:

text_input = Input(shape=X_train_vec.shape[1:], name = "Text_input")

conv2 = Conv1D(filters=128, kernel_size=5, activation='relu')(text_input)
drop21 = Dropout(0.5)(conv2)
conv22 = Conv1D(filters=64, kernel_size=5, activation='relu')(drop21)
drop22 = Dropout(0.5)(conv22)

lstm1 = Bidirectional(LSTM(128, return_sequences = True))(drop22)
lstm2 = Bidirectional(LSTM(64, return_sequences = True))(lstm1)

flat = Flatten()(lstm2)
dense = Dense(128, activation='relu')(flat)
out = Dense(32, activation='relu')(dense)

outputs = Dense(y_train.shape[1], activation='softmax')(out)

model = Model(inputs=text_input, outputs=outputs)
# compile
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

enter image description here

4
The most important question is: "How large is your dataset?" This seems like a very small dataset. If that is the case, your primary reaction should be to gather more data.Felix Kleine Bösing
It is not small. I have like 40000 samples.IMB
Then there must be some other problem. Since the validation loss is barely decreasing it´s not learning anything useful. How many classes do you have? Are your classes balanced?Felix Kleine Bösing
There are two classes and the distribution is 58%, 42%IMB
Hmm. It´s hard to guess without further information, but I still suppose that there might be something wrong with the data. Is this a custom project? What are you trying to predict?Felix Kleine Bösing

4 Answers

2
votes

overfitting can caused by many factors, it happens when your model fits too well to the training set.

To handle it you can do some ways:

  1. Add more data
  2. Use data augmentation
  3. Use architectures that generalize well
  4. Add regularization (mostly dropout, L1/L2 regularization are also possible)
  5. Reduce architecture complexity.

for more clearly you can read in https://towardsdatascience.com/deep-learning-3-more-on-cnns-handling-overfitting-2bd5d99abe5d

1
votes

This is screaming Transfer Learning. google-unversal-sentence-encoder is perfect for this use case. Replace your model with

import tensorflow_hub as hub 
import tensorflow_text

text_input = Input(shape=X_train_vec.shape[1:], name = "Text_input")

# this next layer might need some tweaking dimension wise, to correctly fit
# X_train in the model
text_input = tf.keras.layers.Lambda(lambda x: tf.squeeze(x))(text_input)
# conv2 = Conv1D(filters=128, kernel_size=5, activation='relu')(text_input)
# drop21 = Dropout(0.5)(conv2)
# pool1 = MaxPooling1D(pool_size=2)(drop21)
# conv22 = Conv1D(filters=64, kernel_size=5, activation='relu')(pool1)
# drop22 = Dropout(0.5)(conv22)
# pool2 = MaxPooling1D(pool_size=2)(drop22)

# 1) you might need `text_input = tf.expand_dims(text_input, axis=0)` here
# 2) If you're classifying English only, you can use the link to the normal `google-universal-sentence-encoder`, not the multilingual one
# 3) both the English and multilingual have a `-large` version. More accurate but slower to train and infer. 
embedded = hub.KerasLayer('https://tfhub.dev/google/universal-sentence-encoder-multilingual/3')(text_input) 

# this layer seems out of place, 
# dense = Dense(16, activation='relu')(embedded) 

# you don't need to flatten after a dense layer (in your case) or a backbone (in my case (google-universal-sentence-encoder))
# flat = Flatten()(dense)

dense = Dense(128, activation='relu')(flat)
out = Dense(32, activation='relu')(dense)

outputs = Dense(y_train.shape[1], activation='softmax')(out)

model = Model(inputs=text_input, outputs=outputs)
0
votes

I think since you are doing a text Classification, adding 1 or 2 LSTM layers might help the network learn better, since it will be able to better associate with the context of the data. I suggest adding the following code before the flatten layer.

lstm1 = Bidirectional(LSTM(128, return_sequence = True))
lstm2 = Bidirectional(LSTM(64))

LSTM layers can help neural network learn association between certain words and might improve the accuracy of your network.

I also Suggest dropping the Max Pooling layers as max pooling especially in text classification can lead the network to drop some of the useful features. Just keep the convolutional Layers and the dropout. Also remove the Dense layer before flatten and add the aforementioned LSTMs.

0
votes

It is unclear how you feed the text into your model. I am assuming that you tokenize the text to represent it as a sequence of integers, but do you use any word embedding prior to feeding it into your model? If not, I suggest you to throw atrainable tensorflow Embedding layer at the start of your model. There is a clever technique called Embedding Lookup to speed up its training, but you can save it for later. Try adding this layer to your model. Then your Conv1D layer would have a much easier time working on a sequence of floats. Also, I suggest you throw BatchNormalization after each Conv1D, it should help to speed up convergence and training.