I am new to machine learning.
I have a continuous dataset. I am trying to model the target label using several features. I utilize the train_test_split function to separate the train and the test data. I am training and testing the model using the code below:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = Sequential()
model.add(Dense(128, input_dim=X.shape[1], kernel_initializer = 'normal', activation='relu'))
model.add(Dense(1, kernel_initializer = 'normal'))
hist = model.fit(X_train.values, y_train.values, validation_data=(X_test.values,y_test.values), epochs=200, batch_size=64, verbose=1)
I can get good results when I use X_test and y_test for validation data:
https://drive.google.com/open?id=0B-9aw4q1sDcgNWt5TDhBNVZjWmc
However, when I use this model to predict another data (X_real, y_real) (which are not so different from the X_test and y_test except that they are not randomly chosen by train_test_split) I get bad results:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = Sequential()
model.add(Dense(128, input_dim=X.shape[1], kernel_initializer = 'normal', activation='relu'))
model.add(Dense(1, kernel_initializer = 'normal'))
hist = model.fit(X_train.values, y_train.values, validation_data=(X_real.values,y_real.values), epochs=200, batch_size=64, verbose=1)
https://drive.google.com/open?id=0B-9aw4q1sDcgYWFZRU9EYzVKRFk
Is it an issue of overfitting? If it is so, why does my model work ok with the X_test and y_test generated by train_test_split?