Applying a function to generate confusion matrices from nested lists of classification tree class probabilities within a list

Question

My apologies in advance for such a long and detailed explanation of my problem. I have produced 10 nested dataframes from classification tree class probabilities (grouping factor: G8 and V4) in a master-list using three functions Shuffle100 my_List and Final_lists (below). I am sorry that I am asking this simple question but I wasn't able to figure it out. Many thanks is advance if anyone find a solution.

Aim 1

(1) I would like to insert the functionconfusionMatrix() from the caret package into the function shuffle100 to produce 10 confusion matrices for each subset

Functions `shuffle100`, `my_list` and `Final_lists`

library(plyr)
library(caret)
library(e1071)
library(rpart)

set.seed(1235)

 shuffle100 <-lapply(seq(10), function(n){ #Select the production of 10 dataframes
 subset <- normalised_scores[sample(nrow(normalised_scores), 80),] #Shuffle rows
 subset_idx <- sample(1:nrow(subset), replace = FALSE)
 subset <- subset[subset_idx, ] #training subset
 subset1<-subset[-subset_idx, ] #test subset
 subset_resampled_idx <- createDataPartition(subset_idx, times = 1, p = 0.7, list = FALSE) #70 % training set    
 subset_resampled <- subset[subset_resampled_idx, ]
 ct_mod<-rpart(Matriline~., data=subset_resampled, method="class", control=rpart.control(cp=0.005)) #10 ct
 ct_pred<-predict(ct_mod, newdata=subset[, 2:13]) 
 ct_dataframe=as.data.frame(ct_pred)#create new data frame
 confusionMatrix(ct_dataframe, normalised_scores$Family)
 }

  Error in sort.list(y) : 'x' must be atomic for 'sort.list'
  Have you called 'sort' on a list?

 1: lapply(seq(10), function(n) {
subset <- normalised_scores[sample(nrow(normalised_scores
 2: FUN(X[[i]], ...)
 3: confusionMatrix(ct_dataframe, normalised_scores$Family)
 4: confusionMatrix.default(ct_dataframe, normalised_scores$Family)
 5: factor(data)
 6: sort.list(y)

 #Produce three columns: Predicted, Actual and Binary
 my_list <- lapply(shuffle100, function(df){#Create two new columns Predicted and Actual
                  if (nrow(df) > 0)
                cbind(df, Predicted = c(""), Actual = c(""), Binary = c(""))
         else
                 bind(df, Predicted = character(), Actual = c(""), Binary = c (""))
                 })

#Fill the empty columns with NA's
Final_lists <- lapply(my_list, function(x) mutate(x, Predicted = NA, Actual = NA, Binary = NA)) 

#Create a dataframe from the column normalised_scores$Family to fill the Actual column

Actual_scores<-Final_normalised3$Family
Final_scores<-as.data.frame(Actual_scores)

#Fill in the Predicted, Actual and Binary columns

 Predicted_Lists <- Final_lists %>%
 mutate(Predicted=ifelse(G8 > V4, G8, V4)) %>% # assuming if G8 > V4 then Predicted=G8
 mutate(Actual=Final_scores) %>% # your definition of Actual is not clear
 mutate(Binary=ifelse(Predicted==Actual, 1, 0))

#Error messages

Error in ifelse(G8 > V4, G8, V4) : object 'G8' not found

Aim 2

To write a function or for loop to fill the Predicted, Actual and Binary columns for each subset based on the condition that probabilities in rows for columns V4 or G8 may be greater than or lesser than each other. However, I am confused with the correct syntax for functions and loops

A `for loop` which does not work

  for(i in 1:length(Final_lists)){ #i loops through each dataframe in the list 
   for(j in 2:nrow(Final_lists[[i]])){ #j loops through each row of each dataframe in the list
   if(Final_lists[[i]][j, "G8"] > Final_lists[[i]][j, "V4"]) { #if the probability of G8 > V4 in each row of each dataframe in each list
      Final_lists[[i]][j, [j["Predicted" == "NA"]] ="G8" #G8 will be filled into the same row in the `Predicted' column
      }
    else {
   Final_lists[[i]][j, [Predicted == "NA"]] ="V4" #V4 will be filled into the same row in the `Predicted' column
    }
print(i)
    }
    }

Each subset should have this format when columns are filled:

               G8        V4 Predicted Actual Binary
        0.1764706 0.8235294        V4     V4      1
        0.7692308 0.2307692        G8     V4      0
        0.7692308 0.2307692        G8     V4      0
        0.7692308 0.2307692        G8     V4      0
        0.7692308 0.2307692        G8     V4      0
        0.1764706 0.8235294        V4     V4      1

Filling the `Predicted` column

If the probability of G8 > V4, then the empty Predicted row is assigned G8. However, if V4 > G8, then the empty `Predicted' row will be assigned V4.

Filling the `Actual` column

These are the actual predicted class probability predictions from the classification tree models for each subset, which are contained in the data_frame `normalised_scores

Filling in the `Binary` column

If the Predicted and Actual rows have the same result (e.g. G8 and G8), then the Binary row is assigned the value of 1. However, if rows of the Predicted and Actual columns are different (e.g. G8 and V4), then the Binary row is assigned the value 0.

I achieved these aims using this working code, however, I am unsure how to apply this code to the subsets in the master-list.

Working code for a single subset

      set.seed(1235)

    # Randomly permute the data before subsetting
      mydat_idx <- sample(1:nrow(Final_normalised_scores), replace = FALSE)
      mydat <- Final_normalised3[mydat_idx, ]

      mydat_resampled_idx <- createDataPartition(mydat_idx, times = 1, p = 0.7, list = FALSE)
      mydat_resampled <- mydat[mydat_resampled_idx, ] # Training portion of the data
      mydat_resampled1 <- mydat[-mydat_resampled_idx, ]

      #Classification tree

      ct_mod <- train(x = mydat_resampled[, 2:13], y = as.factor(mydat_resampled[, 1]), 
            method = "rpart", trControl = trainControl(method = "repeatedcv", number=10, repeats=100, classProbs = TRUE))

       #Model predictions
       ct_pred <- predict(ct_mod, newdata = mydat[ , 2:13], type = "prob")
       Final_Predicted<-as.data.frame(ct_pred)

       #Produce three empty columns: Predicted, Actual and Binary

       Final_Predicted$Predicted<-NA
       Final_Predicted$Actual<-NA
       Final_Predicted$Binary<-NA

       #Fill in the Predicted column

      for (i in 1:length(Final_Predicted$G8)){
        if(Final_Predicted$G8[i]>Final_Predicted$V4[i]) {
           Final_Predicted$Predicted[i]<-"G8"
           }
      else {
           Final_Predicted$Predicted[i]<-"V4"
           }
           print(i)
           }

        #Fill in the Actual column using the actual predictions from the dataframe normalised_scores

        Final_Predicted$Actual<-normalised_scores$Family

        #Fill in the Binary column

        for (i in 1:length(Final_Predicted$Binary)){
           if(Final_Predicted$Predicted[i]==Final_Predicted$Actual[i]) {
              Final_Predicted$Binary[i]<-1
              }
         else {
              Final_Predicted$Binary[i]<-0
              }
              print(i)
              }

Subset from the master-list

                  G8        V4 Predicted Actual Binary
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.7692308 0.2307692        NA     NA     NA
           0.1764706 0.8235294        NA     NA     NA

Reproducible dummy data

SummarySE (Rmisc package) to produce a barplot with error bars (ggplot2)

I think you might have better luck getting help if you trim down this question. From what I understand, you essentially want to walk through a list of dataframes (or a list of lists of dataframes) and fill in columns using either data from other columns within the dataframe or from some companion dataset (which I presume has a one-to-one correspondence with the nested list). Can you instead post a subset of the nested list structure, and some sample code for how you would accomplish your task for one element of the list? Also it is not clear to me why you have included Final_normalised. — mikeck
Hi mikeck, Thank you for your advice. I trimmed the question into two subsections as well as adding working code for one subset. I hope this format is better. Thank you. All the best and take care — Alice Hobbs

jfaleiro jfaleiro · Accepted Answer · 2016-04-07T18:11:17

Your description of the problem is a bit long, but a possible dplyr solution would look like this:

Final_Predicted$Actual <- ... # fill actual values
Final_Predicted <- Final_Predicted %>%
              mutate(Predicted=ifelse(G8 > V4, G8, V4)) %>% # assuming if G8==V4 then Predicted=V4
              mutate(Binary=ifelse(Predicted==Actual, 1, 0))

I didn't actually run this solution, but it should be something short and simple along these lines. Hope this helps.

Applying a function to generate confusion matrices from nested lists of classification tree class probabilities within a list

Aim 1

Functions shuffle100, my_list and Final_lists

Aim 2

A for loop which does not work

Filling the Predicted column

Filling the Actual column

Filling in the Binary column