My apologies in advance for such a long and detailed explanation of my problem. I have produced 10 nested dataframes from classification tree class probabilities (grouping factor: G8 and V4) in a master-list using three functions Shuffle100
my_List
and Final_lists
(below). I am sorry that I am asking this simple question but I wasn't able to figure it out. Many thanks is advance if anyone find a solution.
Aim 1
(1) I would like to insert the functionconfusionMatrix()
from the caret package
into the function shuffle100
to produce 10 confusion matrices for each subset
Functions shuffle100
, my_list
and Final_lists
library(plyr)
library(caret)
library(e1071)
library(rpart)
set.seed(1235)
shuffle100 <-lapply(seq(10), function(n){ #Select the production of 10 dataframes
subset <- normalised_scores[sample(nrow(normalised_scores), 80),] #Shuffle rows
subset_idx <- sample(1:nrow(subset), replace = FALSE)
subset <- subset[subset_idx, ] #training subset
subset1<-subset[-subset_idx, ] #test subset
subset_resampled_idx <- createDataPartition(subset_idx, times = 1, p = 0.7, list = FALSE) #70 % training set
subset_resampled <- subset[subset_resampled_idx, ]
ct_mod<-rpart(Matriline~., data=subset_resampled, method="class", control=rpart.control(cp=0.005)) #10 ct
ct_pred<-predict(ct_mod, newdata=subset[, 2:13])
ct_dataframe=as.data.frame(ct_pred)#create new data frame
confusionMatrix(ct_dataframe, normalised_scores$Family)
}
Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?
1: lapply(seq(10), function(n) {
subset <- normalised_scores[sample(nrow(normalised_scores
2: FUN(X[[i]], ...)
3: confusionMatrix(ct_dataframe, normalised_scores$Family)
4: confusionMatrix.default(ct_dataframe, normalised_scores$Family)
5: factor(data)
6: sort.list(y)
#Produce three columns: Predicted, Actual and Binary
my_list <- lapply(shuffle100, function(df){#Create two new columns Predicted and Actual
if (nrow(df) > 0)
cbind(df, Predicted = c(""), Actual = c(""), Binary = c(""))
else
bind(df, Predicted = character(), Actual = c(""), Binary = c (""))
})
#Fill the empty columns with NA's
Final_lists <- lapply(my_list, function(x) mutate(x, Predicted = NA, Actual = NA, Binary = NA))
#Create a dataframe from the column normalised_scores$Family to fill the Actual column
Actual_scores<-Final_normalised3$Family
Final_scores<-as.data.frame(Actual_scores)
#Fill in the Predicted, Actual and Binary columns
Predicted_Lists <- Final_lists %>%
mutate(Predicted=ifelse(G8 > V4, G8, V4)) %>% # assuming if G8 > V4 then Predicted=G8
mutate(Actual=Final_scores) %>% # your definition of Actual is not clear
mutate(Binary=ifelse(Predicted==Actual, 1, 0))
#Error messages
Error in ifelse(G8 > V4, G8, V4) : object 'G8' not found
Aim 2
To write a function or for loop to fill the Predicted
, Actual
and Binary
columns for each subset based on the condition that probabilities in rows for columns V4 or G8 may be greater than or lesser than each other. However, I am confused with the correct syntax for functions and loops
A for loop
which does not work
for(i in 1:length(Final_lists)){ #i loops through each dataframe in the list
for(j in 2:nrow(Final_lists[[i]])){ #j loops through each row of each dataframe in the list
if(Final_lists[[i]][j, "G8"] > Final_lists[[i]][j, "V4"]) { #if the probability of G8 > V4 in each row of each dataframe in each list
Final_lists[[i]][j, [j["Predicted" == "NA"]] ="G8" #G8 will be filled into the same row in the `Predicted' column
}
else {
Final_lists[[i]][j, [Predicted == "NA"]] ="V4" #V4 will be filled into the same row in the `Predicted' column
}
print(i)
}
}
Each subset should have this format when columns are filled:
G8 V4 Predicted Actual Binary
0.1764706 0.8235294 V4 V4 1
0.7692308 0.2307692 G8 V4 0
0.7692308 0.2307692 G8 V4 0
0.7692308 0.2307692 G8 V4 0
0.7692308 0.2307692 G8 V4 0
0.1764706 0.8235294 V4 V4 1
Filling the Predicted
column
If the probability of G8 > V4, then the empty Predicted
row is assigned G8. However, if V4 > G8, then the empty `Predicted' row will be assigned V4.
Filling the Actual
column
These are the actual predicted class probability predictions from the classification tree models for each subset, which are contained in the data_frame `normalised_scores
Filling in the Binary
column
If the Predicted
and Actual
rows have the same result (e.g. G8 and G8), then the Binary
row is assigned the value of 1. However, if rows of the Predicted
and Actual
columns are different (e.g. G8 and V4), then the Binary
row is assigned the value 0.
I achieved these aims using this working code, however, I am unsure how to apply this code to the subsets in the master-list.
Working code for a single subset
set.seed(1235)
# Randomly permute the data before subsetting
mydat_idx <- sample(1:nrow(Final_normalised_scores), replace = FALSE)
mydat <- Final_normalised3[mydat_idx, ]
mydat_resampled_idx <- createDataPartition(mydat_idx, times = 1, p = 0.7, list = FALSE)
mydat_resampled <- mydat[mydat_resampled_idx, ] # Training portion of the data
mydat_resampled1 <- mydat[-mydat_resampled_idx, ]
#Classification tree
ct_mod <- train(x = mydat_resampled[, 2:13], y = as.factor(mydat_resampled[, 1]),
method = "rpart", trControl = trainControl(method = "repeatedcv", number=10, repeats=100, classProbs = TRUE))
#Model predictions
ct_pred <- predict(ct_mod, newdata = mydat[ , 2:13], type = "prob")
Final_Predicted<-as.data.frame(ct_pred)
#Produce three empty columns: Predicted, Actual and Binary
Final_Predicted$Predicted<-NA
Final_Predicted$Actual<-NA
Final_Predicted$Binary<-NA
#Fill in the Predicted column
for (i in 1:length(Final_Predicted$G8)){
if(Final_Predicted$G8[i]>Final_Predicted$V4[i]) {
Final_Predicted$Predicted[i]<-"G8"
}
else {
Final_Predicted$Predicted[i]<-"V4"
}
print(i)
}
#Fill in the Actual column using the actual predictions from the dataframe normalised_scores
Final_Predicted$Actual<-normalised_scores$Family
#Fill in the Binary column
for (i in 1:length(Final_Predicted$Binary)){
if(Final_Predicted$Predicted[i]==Final_Predicted$Actual[i]) {
Final_Predicted$Binary[i]<-1
}
else {
Final_Predicted$Binary[i]<-0
}
print(i)
}
Subset from the master-list
G8 V4 Predicted Actual Binary
0.1764706 0.8235294 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.1764706 0.8235294 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.1764706 0.8235294 NA NA NA
0.1764706 0.8235294 NA NA NA
0.1764706 0.8235294 NA NA NA
0.7692308 0.2307692 NA NA NA
0.1764706 0.8235294 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.1764706 0.8235294 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.1764706 0.8235294 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.1764706 0.8235294 NA NA NA
0.1764706 0.8235294 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.1764706 0.8235294 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.1764706 0.8235294 NA NA NA
0.7692308 0.2307692 NA NA NA
0.1764706 0.8235294 NA NA NA
0.1764706 0.8235294 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.1764706 0.8235294 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.1764706 0.8235294 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.1764706 0.8235294 NA NA NA
0.1764706 0.8235294 NA NA NA
0.7692308 0.2307692 NA NA NA
0.1764706 0.8235294 NA NA NA
0.1764706 0.8235294 NA NA NA
0.1764706 0.8235294 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.1764706 0.8235294 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.1764706 0.8235294 NA NA NA
0.1764706 0.8235294 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.1764706 0.8235294 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.7692308 0.2307692 NA NA NA
0.1764706 0.8235294 NA NA NA
Reproducible dummy data
SummarySE (Rmisc package) to produce a barplot with error bars (ggplot2)
Final_normalised
. – mikeck