0
votes
library(caret)
library(rpart.plot)
car_df <- read.csv("TrainingDataSet.csv", sep = ',', header = TRUE)
str(car_df)

set.seed(3033)
intrain <- createDataPartition(y = car_df$Result, p= 0.7, list = FALSE)
training <- car_df[intrain,]
testing <- car_df[-intrain,]
dim(training)
dim(testing)
anyNA(car_df)
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(3333)
dtree_fit <- train(Result ~., data = training, method = "rpart",
               parms = list(split = "infromation"),
               trControl=trctrl,
               tuneLength = 10)

I get this warning:

Warning message: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, : There were missing values in resampled performance measures.

I am trying to classify whether a movie is hit or flop using number of positive and negative sentiment. Here my data

  dput(car_df) 

structure(list(MovieName = structure(c(20L, 5L, 31L, 26L, 27L, 
12L, 36L, 29L, 38L, 4L, 6L, 8L, 10L, 15L, 18L, 21L, 24L, 34L, 
35L, 7L, 37L, 25L, 23L, 2L, 11L, 40L, 33L, 28L, 14L, 3L, 17L, 
16L, 32L, 22L, 30L, 1L, 19L, 39L, 9L, 13L), .Label = c("#96Movie", 
"#alphamovie", "#APrivateWar", "#AStarIsBorn", "#BlackPanther", 
"#BohemianRhapsody", "#CCV", "#Creed2", "#CrimesOfGrindelwald", 
"#Deadpool2", "#firstman", "#GameNight", "#GreenBookMovie", "#grinchmovie", 
"#Incredibles2", "#indivisiblemovie", "#InstantFamily", "#JurassicWorld", 
"#KolamaavuKokila", "#Oceans8", "#Overlord", "#PariyerumPerumal", 
"#RalphBreaksTheInternet", "#Rampage", "#Ratchasan", "#ReadyPlayerOne", 
"#RedSparrow", "#RobinHoodMovie", "#Sarkar", "#Seemaraja", "#Skyscraper", 
"#Suspiria", "#TheLastKey", "#TheNun", "#ThugsOfHindostan", "#TombRaider", 
"#VadaChennai", "#Venom", "#Vishwaroopam2", "#WidowsMovie"), class = "factor"), 
    PositivePercent = c(40.10554, 67.65609, 80.46796, 71.34831, 
    45.36082, 68.82591, 46.78068, 63.85787, 47.20497, 32.11753, 
    63.7, 39.2, 82.76553, 88.78613, 72.18274, 72.43187, 31.0089, 
    38.50932, 38.9, 19.9, 84.26854, 29.4382, 58.13953, 86.9281, 
    64.54965, 56, 0, 56.61914, 58.82353, 54.98891, 78.21682, 
    90, 64.3002, 85.8, 51.625, 67.71894, 92.21557, 53.84615, 
    40.12158, 68.08081), NegativePercent = c(11.34565, 21.28966, 
    6.408952, 13.10861, 26.80412, 17.10526, 18.61167, 10.55838, 
    46.48033, 56.231, 9.9, 12.1, 9.018036, 6.473988, 13.90863, 
    16.77149, 63.20475, 42.54658, 40.9, 5.4, 3.907816, 2.022472, 
    10.51567, 3.267974, 15.12702, 15.3, 100, 18.12627, 11.76471, 
    13.41463, 5.775076, 10, 20.08114, 2.1, 5.5, 7.739308, 0, 
    34.61538, 12.86727, 10.70707), Result = structure(c(2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 
    1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Flop", "Hit"
    ), class = "factor")), class = "data.frame", row.names = c(NA, 
-40L))
1
Can you include a sample of your data using the dput(car_df) command?morgan121
i have edited please find it above @user10626943Rakesh
This is a warning, not an error. Does you tree get fitted or not?desertnaut
no it does not filter @desertnaut but all my features are not "Factor". Out of my 4 columns two are factors and two are num. will that be a reason?Rakesh
i converted to factor but still same error msgRakesh

1 Answers

1
votes
> str(car_df)
'data.frame':   40 obs. of  4 variables:
 $ MovieName      : Factor w/ 40 levels "#96Movie","#alphamovie",..: 20 5 31 26 27 12 36 29 38 4 ...
 $ PositivePercent: num  40.1 67.7 80.5 71.3 45.4 ...
 $ NegativePercent: num  11.35 21.29 6.41 13.11 26.8 ...
 $ Result         : Factor w/ 2 levels "Flop","Hit": 2 2 2 2 2 2 2 2 2 1 ...

> with(car_df, table( Result))
Result
Flop  Hit 
   5   35 

 > dtree_fit
CART 

29 samples
 3 predictor
 2 classes: 'Flop', 'Hit' 

So you have an outcome with 5 flops, and one of the predictors is a variable with 40 different values. This does not seem surprising given that each of your cases is unique and you have severely unbalanced outcome. The presence of data does not guarantee the possibility of substantial conclusions. If there's any error here, it's the lack of code in the fitter that would say something along the lines of "Really? You think statistical packages should be able to solve a severe lack of data?"

BTW: should be (but unsurprisingly doesn't clear the warning):

(split = "information")

If you change the number of cross-validation bins to a number that would allow flops to be distributed among the various bins then you can get a non-warning result. Whether it will have much validity remains questionable, given the small sample size:

> trctrl <- trainControl(method = "repeatedcv", number = 3, repeats = 3)
 set.seed(3333)
 dtree_fit <- train(Result ~., data = training, method = "rpart",
                    parms = list(split = "infromation"),
                    trControl=trctrl,
                    tuneLength = 10)
# no warning on one of my runs