In this example I have a hypothetical balanced dataset containing several attributes about college students and one target attribute indicating whether they passed their exam or not (0=fail 1=pass). I have created and fit a GBM model (scikit-learn xgboost) with 75% of my original dataset (18000ish records) and am seeing 80% accuracy and 91.6% precision on my holdout set (4700 records) in regards to students that have failed the exam.
At this point, I would very much like to now use 100% of this dataset as training data and use a new set of 2000 student records (balanced) as test data. I want to make predictions for dataset B based on the training of dataset A. Ultimately, I would like to offer these predictions to my boss/superior as a way to validate my work and then begin feeding new data to my model in order to predict how future students might perform on that exam. I am currently stuck on how to go about using my entire original dataset as my training material and the entire new dataset as testing material.
I have attempted to use
X = original data minus target feature
y = original data target feature only
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =
0.00001, random_state=0)
and
N = new data minus target feature
z = new data target feature only
N_train, N_test, z_train, z_test = (train_test_split(N, z, test_size =
.999, random_state=0))
to create my test and train variables.I then attempt to fit and pass new records to my model using:
# Fit model with original X and y data
xg_class.fit(X_train, y_train)
# Generate predictions based off of X_test
new_preds = xg_class.predict(N_test)
I'm not getting any errors and but my output is FAR lower than my initial results from splitting dataset A.
Accuracy (75%/25% split of dataset A): 79%
Precision (75%/25% split of dataset A): 91.1% TP / 71.5% TN
Accuracy (99% trained dataset A, tested dataset B): 45%
Precision (99% trained dataset A, tested dataset B): 18.7% TP / 62.4% TN
Is this due to the disparity in size of one/both of my datasets or is this to be expected? From what I'm reading, this could be a methodology issue from using two unique datasets for training and testing. However, if that is the case then I don't see what the point in building a model would even be, as it can't be fed unique data with any reasonable expectation of success. I obviously don't believe that to be true, but I haven't found any info through my searching about how one performs this part of model evaluation. If anyone could help me with some general insight, that would be appreciated.