0
votes

I am a biologist and starting to find my way to the world of Deep Learning. So, I have read a number of books and online tutorials. In short, I am building a model to use 522 variables in a dataset of 6500 records to predict a binary class by keras in R. The main codes for the mode are as follow:

model <- keras_model_sequential() 
model %>% 
  layer_dense(units = 256, activation = 'relu', input_shape = ncol(x_train),kernel_regularizer = regularizer_l2(0.001),) %>% 
  layer_dropout(rate = 0.4) %>% 
  layer_dense(units = 128, activation = 'relu',kernel_regularizer = regularizer_l2(0.001),) %>%
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = 2, activation = 'sigmoid')

history <- model %>% compile(
  loss = 'binary_crossentropy',
  optimizer = 'adam',
  metrics = c('accuracy')
)

model %>% fit(
  x_train, y_train, 
  epochs = 50, 
  batch_size = 150,
  validation_split = 0.20
)

acc<-model %>% evaluate(x_test, y_test)

I have tuned the hidden unit in each layer, batch size, epochs to lower and higher but the accuracy I got is unsatisfactory. Specifically, the training accuracy quickly reached 70-90% depending on the number of hidden units I added, but in any cases the validation accuracy is never above 30%. When I applied the model to predict the test set, I got accuracy of 70%, but the problem is that when I looked further to the confusion matrix table, the model seemed to just predict well class 1 (sensitivity is 97%) why class 0 is poorly predicted (specificity is about 20%).

I actually ran the same data using PLS-DA with mixOmics package and the results are rather good. On test set, I got sensitivity, specificity, and area under the curve all >=70%.

So, I am not requiring deep learning (in this case) to be better that PLS-DA, but I hope it should be somewhere near PLS-DA.

Can you give me some advice so that I can move in the right direction to improve the model of deep learning. The train and test data that I am working on can be obtained here: https://drive.google.com/file/d/1XFmTosHk5hZABFgJOHgQGLiP-DnbGHLv/view?usp=sharing https://drive.google.com/file/d/10viyKknQNolgCR45mEijF5RIxKqMK23a/view?usp=sharing

Many thanks, Ho

1

1 Answers

1
votes

I played a little bit with your data...

Colab notebook: https://colab.research.google.com/drive/1SpkNIxCp56iATSybLZO9OhYDlK-uFXVl

A simple linear regression gives me an auc score on the test data of 0.63 using sklearn.metrics.roc_auc_score. That is with this simple model:

tf.set_random_seed(42)

model = keras.models.Sequential([
    keras.layers.Input(shape=(len(df_train.columns) - 1)),
    keras.layers.Dense(1, activation='sigmoid',
                       kernel_regularizer=keras.regularizers.l2(0.01))
])

model.compile(keras.optimizers.Adam(lr=0.0001), 'binary_crossentropy', metrics=['acc'])
model.summary()

That didn't seem to improve by adding additional hidden layers with 'relu' transformations.

For instance this model:

tf.set_random_seed(42)

model = keras.models.Sequential([
    keras.layers.Input(shape=(len(df_train.columns) - 1)),
    keras.layers.Dense(512, activation='relu',
                      kernel_regularizer=keras.regularizers.l2(0.01)),
    keras.layers.Dense(256, activation='linear',
                      kernel_regularizer=keras.regularizers.l2(0.01)),
    keras.layers.Dense(128, activation='relu',
                      kernel_regularizer=keras.regularizers.l2(0.01)),
    keras.layers.Dense(64, activation='linear',
                      kernel_regularizer=keras.regularizers.l2(0.01)),
    keras.layers.Dense(1, activation='sigmoid',
                       kernel_regularizer=keras.regularizers.l2(0.01))
])

model.compile(keras.optimizers.Adam(lr=0.001), 'binary_crossentropy', metrics=['acc'])
model.summary()

Gave me a auc score of 0.62 after 50 iterations. Slightly worse than a pure linear regression.

I understand that you asked specifically about keras, however I tried using an LGBM model and that yielded an auc score of 0.78 which is significantly better.

import lightgbm as lgb

train_data = lgb.Dataset(df_train[data_cols].values, label=df_train['Class'])
eval_data = lgb.Dataset(df_test[data_cols].values, label=df_test['Class'])

param = {
    'num_leaves': 31,
    'objective': 'binary',
    'metric': 'auc',
}

bst = lgb.train(param, train_data, 500, valid_sets=[eval_data],
                early_stopping_rounds=5)
bst.save_model('model.txt', num_iteration=bst.best_iteration)

That leads me to believe that for this type of problem decision trees have higher accuracy; perhaps other members of this forum can suggest ways to achieve similar results with keras...

I thought that although this post doesn't really answer your question, it may be of interest to you.