Dummy coding omits / removes select variables from the data frame R

Question

I have a fairly large dataset 1460(n)x81(p). About 38 variables are numeric and rest are factors with levels ranging from 2-30. I am using dummy.data.frame from *dummies package to encode the factor variables for use in running regression models.

However, as I run the following code:

train_dummy <- dummy.data.frame(train, sep = ".", verbose = TRUE, all = TRUE) some of the colums are from the original dataset are removed.

Has anyone encountered such issue before?

Link to original training dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

A number of columns from the original dataset including response variable SalePrice are being dropped. Any ideas/suggestions on what to try?

TaylorV TaylorV · Accepted Answer · 2017-03-15T14:47:45

I wasn't able to reproduce the issue. I don't think there is enough info here to reproduce the issue, but I do have a few first thoughts.

run dummy data processing before train/test split

I see you're running the dummy data solely on your training data. I've found that it is usually a better strategy to run dummy data processing on the entire dataset as a whole, and then split into train / test.

Sometimes when you split first, you can run into issues with the levels of your factors.

Let's say I have a field called colors which is a factor in my data that contains the levels red, blue, green. If I split my data into train and test, I could run into a scenario where my training data only has red and blue values and no green. Now if my test dataset has all three, there will be a difference between the number of columns in my train vs test data.

I believe one way around that issue is the drop parameter in the dummy.data.frame function which defaults to TRUE.

things to check

Run these before running dummy data processing for train and test to see what characteristics these fields have that are being dropped:

# find the class of each column
train_class <- sapply(train, class)   
test_class <- sapply(test, class)

# find the number of unique values within each column
unq_train_vals <- sapply(train, function(x) length(unique(x))) 
unq_test_vals <- sapply(test, function(x) length(unique(x)))

# combine into data frame for easy comparison
mydf <- data.frame(
    train_class = train_class,
    test_class = test_class,
    unq_train_vals = unq_train_vals,
    unq_test_vals = unq_test_vals

)

I know this isn't really an "answer", but I don't have enough rep to comment yet.

Dummy coding omits / removes select variables from the data frame R

1 Answers

run dummy data processing before train/test split

things to check