0
votes

In this example I have a hypothetical balanced dataset containing several attributes about college students and one target attribute indicating whether they passed their exam or not (0=fail 1=pass). I have created and fit a GBM model (scikit-learn xgboost) with 75% of my original dataset (18000ish records) and am seeing 80% accuracy and 91.6% precision on my holdout set (4700 records) in regards to students that have failed the exam.

At this point, I would very much like to now use 100% of this dataset as training data and use a new set of 2000 student records (balanced) as test data. I want to make predictions for dataset B based on the training of dataset A. Ultimately, I would like to offer these predictions to my boss/superior as a way to validate my work and then begin feeding new data to my model in order to predict how future students might perform on that exam. I am currently stuck on how to go about using my entire original dataset as my training material and the entire new dataset as testing material.

I have attempted to use

X = original data minus target feature
y = original data target feature only
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 
0.00001, random_state=0)

and

N = new data minus target feature
z = new data target feature only
N_train, N_test, z_train, z_test = (train_test_split(N, z, test_size = 
.999, random_state=0))

to create my test and train variables.I then attempt to fit and pass new records to my model using:

# Fit model with original X and y data
xg_class.fit(X_train, y_train)

# Generate predictions based off of X_test
new_preds = xg_class.predict(N_test)

I'm not getting any errors and but my output is FAR lower than my initial results from splitting dataset A.

Accuracy (75%/25% split of dataset A):  79%
Precision (75%/25% split of dataset A): 91.1% TP / 71.5% TN

Accuracy (99% trained dataset A, tested dataset B): 45%
Precision (99% trained dataset A, tested dataset B): 18.7% TP / 62.4% TN

Is this due to the disparity in size of one/both of my datasets or is this to be expected? From what I'm reading, this could be a methodology issue from using two unique datasets for training and testing. However, if that is the case then I don't see what the point in building a model would even be, as it can't be fed unique data with any reasonable expectation of success. I obviously don't believe that to be true, but I haven't found any info through my searching about how one performs this part of model evaluation. If anyone could help me with some general insight, that would be appreciated.

1

1 Answers

0
votes

turns out part one of my question is an easy answer: do not use train_test_split(). you assign your particular algorithm to a variable (ex. 'model') and then fit it with all the data in the same manner as train_test_split.

model.fit(X, y)

you then pass new data (for example, N as feature data and z as the label)

new_predictions = model.predict(N)

the second part of my question still eludes me.