2
votes

I am new to machine learning.

I have a continuous dataset. I am trying to model the target label using several features. I utilize the train_test_split function to separate the train and the test data. I am training and testing the model using the code below:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = Sequential()
model.add(Dense(128, input_dim=X.shape[1], kernel_initializer = 'normal', activation='relu'))
model.add(Dense(1, kernel_initializer = 'normal'))
hist = model.fit(X_train.values, y_train.values, validation_data=(X_test.values,y_test.values), epochs=200, batch_size=64, verbose=1) 

I can get good results when I use X_test and y_test for validation data:

https://drive.google.com/open?id=0B-9aw4q1sDcgNWt5TDhBNVZjWmc

However, when I use this model to predict another data (X_real, y_real) (which are not so different from the X_test and y_test except that they are not randomly chosen by train_test_split) I get bad results:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = Sequential()
model.add(Dense(128, input_dim=X.shape[1], kernel_initializer = 'normal', activation='relu'))
model.add(Dense(1, kernel_initializer = 'normal'))
hist = model.fit(X_train.values, y_train.values, validation_data=(X_real.values,y_real.values), epochs=200, batch_size=64, verbose=1) 

https://drive.google.com/open?id=0B-9aw4q1sDcgYWFZRU9EYzVKRFk

Is it an issue of overfitting? If it is so, why does my model work ok with the X_test and y_test generated by train_test_split?

2

2 Answers

1
votes

Seems that your "real data" differs from your train and test data. Why do you have "real" and "training" data in the first place?

My approach would be:

1: Mix up all Data you have

2: Devide your Data randomly in 3 sets (train, test and validate)

3: use train and test like you do it now and optimize your classifier

4: When it's good enough validate the classifier with your validation set to make sure no overfitting occurs.

1
votes

If you have less data then I would suggest you to try a different algorithm. Neural networks generally need a lot of data to get the weights right. Also, your real data doesn't seem to belong to the same distribution as the train and test data. Don't keep anything hidden, shuffle everything and use Train/Validation/Test splits.