0
votes

I have a question about R code...

I am having a problem when I try to add features into my model. Our professor gave us some code to do lasso regression on Magic The Gathering card prices. If I run his code as is it will work. Whenever I try to add another column as a feature into his code I have problems.

Here is the error: "Error in cbind2(1, newx) %*% nbeta : Cholmod error 'X and/or Y have wrong dimensions' at file ../MatrixOps/cholmod_sdmult.c, line 90"

A screenshot of my command line And then if I drop some of the columns in the larger training dataset then I still get the same error.

After making the data frames the same # of columns

As I run through the code, I check the dimensions of the data frames 'test' and 'train' and I figured out which lines are changing the test and train data frames.

These lines:

dummies <- dummyVars(future_price ~ ., data = train)
train<-predict(dummies, newdata = train)
test<-predict(dummies, newdata = test)

So, before running these lines, both the train and test data set have exactly 23 variables (columns). After running these three dummy lines, the test data set has 41 columns, and the training data set has 47 columns. I don't really understand how a different number of columns get added to each data frame if the lines of code are the same besides substituting 'train' and 'test'.

Please help! Thanks.

1
Welcome to SO; please see why an image of your code is not helpful, and the same holds true for an exception (error). Please copy & paste here as text.desertnaut

1 Answers

0
votes

The problem is that the new feature you added needs to be converted to factor using as.factor

Let's reproduce your error

df <- data.frame(cat = c('A','B','C','B','A'),target=c(0,0,1,1,0))
df$cat <- as.character(df$cat)
train <- df[1:2,]
test <- df[3:5,]
dv_train <- dummyVars(target~.,train)
predict(dv_train,train)
# no column catC is created because in train there is no row where cat=="C"
#      catA     catB
#1        1        0
#2        0        1
predict(dv_train,test)
#  catA catB catC
#3    0    0    1
#4    0    1    0
#5    1    0    0

You can see that you get dataframes with different number of columns because you have different number of levels in train and test

To solve this problem, you should convert all your character variables to factors before splitting your dataframe between test and train, this way when dummyVars gets executed, each level will create one new column

# Convert cat column to factor
df$cat <- as.factor(df$cat)
train <- df[1:2,]
test <- df[3:5,]
dv_train <- dummyVars(target~.,train)
predict(dv_train,train)
#   cat.A cat.B cat.C
# 1     1     0     0
# 2     0     1     0

Now there is a column for cat C even though C still does not appear in train