1
votes

I've fitted a multi-linear regression model using all predictors from my training set except for 'lastname' using lm(), and now I want to make predictions based on my test set. However, when I try to do that with predict(model.fit, test), I get an error regarding the variable 'lastname'

I've tried passing in a test set excluding the column 'lastname' but that didn't work

Code:

cf_df <- read.csv(file="cap_friendly_data.csv", header=TRUE, sep=",")

new_cols <- c('lastname', 'Position', 'Age.Years', 'Original.Cap.Hit', 'New.Signing.Status', 'PPG.Prior.Signing', 'PPG.Contract.Year', 'New.Cap.Hit')

new_stats <- cf_df[, new_cols]

#create training and testing datasets
set.seed(2430)
num_training_samples <- 2000
train_indices <- sample(1:nrow(new_stats), num_training_samples,  replace = FALSE,)
train <- new_stats[train_indices, ]
test <- new_stats[-train_indices, ]
test_results <- test$New.Cap.Hit

#fit model
cap.fit <- lm(New.Cap.Hit ~ . - lastname, data = train)
summary(cap.fit)

predictions <- predict(cap.fit, test)

I thought I'd just get a list of predictions from the model but instead I got this error message:

predictions <- predict(cap.fit, test)

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor lastname has new levels Ã…berg, Acciari, Acolatse, Alfredsson, Anderson, Angelidis, Arnold, Backes, Balisy, Baptiste, Barch...

1

1 Answers

0
votes

Can you try this?

str(new_stats)

# remove column
new_stats = subset(new_stats, select = -c(lastname))

#create training and testing datasets
set.seed(2430)
num_training_samples <- 2000
train_indices <- sample(1:nrow(new_stats), num_training_samples,  replace = FALSE,)
train <- new_stats[train_indices, ]
test <- new_stats[-train_indices, ]
test_results <- test$New.Cap.Hit

#fit model
cap.fit <- lm(New.Cap.Hit ~ ., data = train)
summary(cap.fit)

# do predictions
predictions <- predict(cap.fit, test)