How to study the effect of each data on a deep neural network model?

Question

I'm working on a training a neural network model using Python and Keras library.

My model test accuracy is very low (60.0%) and I tried a lot to rise it, but I couldn't. I'm using DEAP dataset (total 32 participants) to train the model. The splitting technique that I'm using is a fixed one. It was as the followings:28 participants for training, 2 for validation and 2 for testing.

For the model I'm using is as follows.

sequential model
Optimizer = Adam
With L2_regularizer, Gaussian noise, dropout, and Batch normalization
Number of hidden layers = 3
Activation = relu
Compile loss = categorical_crossentropy
initializer = he_normal

Now, I'm using train-test technique (fixed one also) to split the data and I got better results. However, I figured out that some of the participants are affecting the training accuracy in a negative way. Thus, I want to know if there is a way to study the effect of the each data (participant) on the accuracy (performance) of a model?

Best Regards,

Don't know the details of your model but a dataset with 32 entry seems really small for a neural network, maybe you should go simpler. Train-validate-test is the way to go for unbiased results but if you are not doing hyper-parameter tuning a train-test split should be OK. Shouldn't change much accuracy though (probably due to small dataset), if you want, you can try something like k-fold cross validation which would use all of your data for training. You can use anomaly detection etc.. to find and eliminate bad data.. but since you already have a small dataset maybe find a way to populate it? — umutto
Thanks for answering @umutto, I forget to mention that for each participant there are 40 trials, thus, the total size of the data set is (1280 x 503), where 503 is the number of features. I already try the k-fold, it also give a small accuracy that why I'm trying to find about the bad data. Isn't adding noise or duplicate the data is one of the solutions for small dataset? — sakurami
Your question is too broad with very little info actually offered. In order for others to be able to help you, please see 'How to create a Minimal, Complete, and Verifiable example' stackoverflow.com/help/mcve — desertnaut
1280 is still small (especially with 503 features) but should work, your network should at least be overfitting. In which you can have better results by having a good regularization method. How is your training accuracy? Also yes, adding noise, creating artificial data is helpful, but I'm not sure what kind of data augmentation method would be useful for your dataset, I guess you can start with some noise and see. Cross validation, train-test split are related to how you measure results, although bad implementation could give misleading results, you should focus on your hyper-parameters. — umutto
@umutto yes, it still small and when I used a higher number of features, I did not got better results. Yes, there is overfitting in the data and I tried to solve it using dropout and L2-regularisation. As I said, now I am using train-test splitting technique (80% training, 20% testing) and the accuracy increased to 68% for the test and 66% for training. I tried doing grid search for the hyper-parameters with k-fold splitting but the higher accuracy I got is 60%. — sakurami

Piotr Migdal Piotr Migdal · Accepted Answer · 2018-04-13T10:10:07

From my Starting deep learning hands-on: image classification on CIFAR-10 tutorial, in which I insist on keeping track of both:

global metrics (log-loss, accuracy),
examples (correctly and incorrectly classifies cases).

The later may help us telling which kinds of patterns are problematic, and on numerous occasions helped me with changing the network (or supplementing training data, if it was the case).

And example how does it work (here with Neptune, though you can do it manually in Jupyter Notebook, or using TensorBoard image channel):

And then looking at particular examples, along with the predicted probabilities:

Full disclaimer: I collaborate with deepsense.ai, the creators or Neptune - Machine Learning Lab.

How to study the effect of each data on a deep neural network model?

2 Answers