I wasn't able to reproduce the issue. I don't think there is enough info here to reproduce the issue, but I do have a few first thoughts.
run dummy data processing before train/test split
I see you're running the dummy data solely on your training data. I've found that it is usually a better strategy to run dummy data processing on the entire dataset as a whole, and then split into train / test.
Sometimes when you split first, you can run into issues with the levels of your factors.
Let's say I have a field called colors
which is a factor in my data that contains the levels red
, blue
, green
. If I split my data into train and test, I could run into a scenario where my training data only has red
and blue
values and no green
. Now if my test dataset has all three, there will be a difference between the number of columns in my train vs test data.
I believe one way around that issue is the drop
parameter in the dummy.data.frame
function which defaults to TRUE.
things to check
Run these before running dummy data processing for train and test to see what characteristics these fields have that are being dropped:
# find the class of each column
train_class <- sapply(train, class)
test_class <- sapply(test, class)
# find the number of unique values within each column
unq_train_vals <- sapply(train, function(x) length(unique(x)))
unq_test_vals <- sapply(test, function(x) length(unique(x)))
# combine into data frame for easy comparison
mydf <- data.frame(
train_class = train_class,
test_class = test_class,
unq_train_vals = unq_train_vals,
unq_test_vals = unq_test_vals
)
I know this isn't really an "answer", but I don't have enough rep to comment yet.