What might be the error in this line of code in R?

Question

fit <- randomForest(class~. ,data = train_data)

Can anyone tell me what is wrong with this line of code?

Here train_data is the training data for predicting the income to be >50k or <50k and the error that I got in this line was:

Error in y - ymean : non-numeric argument to binary operator In addition: Warning messages: 1: In randomForest.default(m, y, ...) :
The response has five or fewer unique values. Are you sure you want to do regression? 2: In mean.default(y) : argument is not numeric or logical: returning NA

Hi @FattyAcids, welcome to SO. It looks like a ok line of code, are you encountering any errors with it? If so, please provide more information about the issue, and show us what is train_data — StupidWolf
train_data is the training data for predicting the income to be >50k or <50k and the error I got was this -------------------------- Error in y - ymean : non-numeric argument to binary operator In addition: Warning messages: 1: In randomForest.default(m, y, ...) : The response has five or fewer unique values. Are you sure you want to do regression? 2: In mean.default(y) : argument is not numeric or logical: returning NA — Fatty Acids
Thanks for providing the info. Can you update your post. As it is right now I think it's unlikely to get high quality answers and might be closed — StupidWolf
please edit.. there's a tab under your question to edit the question — StupidWolf

StupidWolf StupidWolf · Accepted Answer · 2020-07-08T12:59:45

Seems like you are trying to do classification on a character dependent variable. Let's say we use this fantastic dataset from kaggle:

library(randomForest)
train_data = read.csv("credit_train.csv",stringsAsFactors=FALSE)

str(train_data)
'data.frame':   808 obs. of  17 variables:
 $ Class                         : chr  "Good" "Bad" "Good" "Good" ...
 $ Duration                      : int  6 48 12 36 24 12 30 48 12 24 ...
 $ Amount                        : int  1169 5951 2096 9055 2835 3059 5234 4308 1567 1199 ...
 $ InstallmentRatePercentage     : int  4 2 2 2 3 2 4 3 1 4 ...
 $ ResidenceDuration             : int  4 2 3 4 4 4 2 4 1 4 ...
 $ Age                           : int  67 22 49 35 53 61 28 24 22 60 ...
 $ NumberExistingCredits         : int  2 1 1 1 1 1 2 1 1 2 ...
 $ NumberPeopleMaintenance       : int  1 1 2 2 1 1 1 1 1 1 ...
 $ Telephone                     : int  0 1 1 0 1 1 1 1 0 1 ...
 $ ForeignWorker                 : int  1 1 1 1 1 1 1 1 1 1 ...
 $ CheckingAccountStatus.lt.0    : int  1 0 0 0 0 0 0 1 0 1 ...
 $ CheckingAccountStatus.0.to.200: int  0 1 0 0 0 0 1 0 1 0 ...
 $ CheckingAccountStatus.gt.200  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ CreditHistory.ThisBank.AllPaid: int  0 0 0 0 0 0 0 0 0 0 ...
 $ CreditHistory.PaidDuly        : int  0 1 0 1 1 1 0 1 1 0 ...
 $ CreditHistory.Delay           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ CreditHistory.Critical        : int  1 0 1 0 0 0 1 0 0 1 ...

fit <- randomForest(Class~. ,data = train_data)

Error in y - ymean : non-numeric argument to binary operator
In addition: Warning messages:
1: In randomForest.default(m, y, ...) :
  The response has five or fewer unique values.  Are you sure you want to do regression?
2: In mean.default(y) : argument is not numeric or logical: returning NA

You can see I get the same error. Your dependent variable is a character. we convert it into a factor and it works:

train_data$Class = factor(train_data$Class)

fit <- randomForest(Class~. ,data = train_data)

What might be the error in this line of code in R?

1 Answers