train model with dataset A and test with dataset B

Question

In this example I have a hypothetical balanced dataset containing several attributes about college students and one target attribute indicating whether they passed their exam or not (0=fail 1=pass). I have created and fit a GBM model (scikit-learn xgboost) with 75% of my original dataset (18000ish records) and am seeing 80% accuracy and 91.6% precision on my holdout set (4700 records) in regards to students that have failed the exam.

At this point, I would very much like to now use 100% of this dataset as training data and use a new set of 2000 student records (balanced) as test data. I want to make predictions for dataset B based on the training of dataset A. Ultimately, I would like to offer these predictions to my boss/superior as a way to validate my work and then begin feeding new data to my model in order to predict how future students might perform on that exam. I am currently stuck on how to go about using my entire original dataset as my training material and the entire new dataset as testing material.

I have attempted to use

X = original data minus target feature
y = original data target feature only
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 
0.00001, random_state=0)

and

N = new data minus target feature
z = new data target feature only
N_train, N_test, z_train, z_test = (train_test_split(N, z, test_size = 
.999, random_state=0))

to create my test and train variables.I then attempt to fit and pass new records to my model using:

# Fit model with original X and y data
xg_class.fit(X_train, y_train)

# Generate predictions based off of X_test
new_preds = xg_class.predict(N_test)

I'm not getting any errors and but my output is FAR lower than my initial results from splitting dataset A.

Accuracy (75%/25% split of dataset A):  79%
Precision (75%/25% split of dataset A): 91.1% TP / 71.5% TN

Accuracy (99% trained dataset A, tested dataset B): 45%
Precision (99% trained dataset A, tested dataset B): 18.7% TP / 62.4% TN

Is this due to the disparity in size of one/both of my datasets or is this to be expected? From what I'm reading, this could be a methodology issue from using two unique datasets for training and testing. However, if that is the case then I don't see what the point in building a model would even be, as it can't be fed unique data with any reasonable expectation of success. I obviously don't believe that to be true, but I haven't found any info through my searching about how one performs this part of model evaluation. If anyone could help me with some general insight, that would be appreciated.

Nick Bohl Nick Bohl · Accepted Answer · 2019-10-14T20:32:54

turns out part one of my question is an easy answer: do not use train_test_split(). you assign your particular algorithm to a variable (ex. 'model') and then fit it with all the data in the same manner as train_test_split.

model.fit(X, y)

you then pass new data (for example, N as feature data and z as the label)

new_predictions = model.predict(N)

the second part of my question still eludes me.

train model with dataset A and test with dataset B

1 Answers