1
votes

I am very green to R so please bear with my wording. I have a df from a csv that has 106 obs of 11 variables. I only care about 2 of those variables so I made a new df called "df."

bc=read.csv("---.csv")
df=cbind.data.frame('A'=bc$A,'B'=bc$B)

#Example of the new df:

A       B
mass    0.1
mass    0.2
height  0.5
height  0.3
color   0.9
color   0.1

Then I made (4) vectors, each based on how many rows could satisfy (2) simultaneous conditions: greater than OET or less than OET AND type is "mass" or type is not "mass."

TP= df[df$B>=i & df$A=="mass",]     
TN= df[df$B<=i & df$A!="mass",]
FP= df[df$B<=i & df$A!="mass",]
FN= df[df$B<=i & df$A=="mass",]

I think I want to use a for loop so I could have a vector for every B condition, every i. If I set "i" to a value, the vectors will give me all rows that fit and then nrow("vector") to see how many rows that is- but I cannot type all 106 df$B values into i. I did print to see if my i would work and it showed that I could get every row from df$B. So then I tried with half of the TP vector with df$A. That worked. Now I tried the df$B part alone. But this gave me all 106 obs which I know is wrong becuse the non-looped TP gave me 21 obs. The end goal of the code is to give me a number of TP and and TN for every df$B that meets my (2) conditions so that I can plug them into another function to ggplot. [like Y=TP/TP-TN]

N=c(df$B)
for(i in N){
print(paste(i))
}   
# worked
          
for(i in N){
TPA=df[df$A=="mass",]
TP=nrow(TPA)
}
# worked

for(i in N){
TPB=df[df$B>=i,]
TP=nrow(TPB)
}
#ran but did not do what I wanted

I guess my question is how do I run all rows of df$B against each df$B, all 106 of them, and store them?

When i = df$B[1], how many rows of df$B are >i

When i= df$B[2], how many rows of df$B are >i

From a formula like this, I would like an output like below:

results=data.frame(matrix(nrow=,ncol=4))
colnames(results)=c("A","B","TP","TN")
B=rep(c("mass","not mass"),each=106)
N=c(df$B)
for(i in N){
    TPC=df[df$A=='mass' & df$B>=i,]
    TP=nrow(TPC)
    TNC=df[df$A!='mass' & df$B<=i,]
    TN=nrow(TNC)
    } 
results=cbind.data.frame(B,A,results)

B       A        TP   TN
mass    df$B[1]  21   0
mass    df$B[2]  18   12
...
notmass df$B[1]  1    11
notmass df$B[2]  3    10
...

If you read this far, thank you! Any direction or answer would be most appreciated!

1
I'm not entirely sure, but I think there might be a typo in your FN= df[df$B<=i & df$A=="mass",] line. Did you mean df$B >= i perhaps? - Jon Spring
Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. - Community♦

1 Answers

2
votes

I'm not sure I'm understanding the terms of your confusion matrix properly, but here's a suggestion for a general approach that seems to me more idiomatic to R, using in this case dplyr and tidyr.

Starting with your data:

df1 <- data.frame(
  stringsAsFactors = FALSE,
                 A = c("mass", "mass", "height", "height", "color", "color"),
                 B = c(0.1, 0.2, 0.5, 0.3, 0.9, 0.1)
)

We can add a logical mass variable to capture if A is or isn't equal to "mass". We can also make a list of the values of B to use later.

df1$mass = df1$A == "mass" 
B_val = sort(unique(df1$B))

Below, I make a copy of the data for each value of B_val and use dplyr::case_when to define the values of the confusion matrix. (I suspect I don't have these right, but should be simple to fix.)

Finally, at the bottom I count how many combinations arise, and then reshape the data into wider format with columns named for each conclusion.

library(dplyr); library(tidyr)
df1 %>%
  crossing(B_val) %>%
  mutate(type = case_when(
    B >= B_val & mass  ~ "TP",
    B <= B_val & !mass ~ "TN",
    B <= B_val & mass  ~ "FP",
    B >= B_val & !mass ~ "FN",
    TRUE ~ "undefined"
  )) %>%
  
  count(mass, B_val, type) %>%
  # group_by(mass, B_val) %>%   #un-comment these lines for proportions
  # mutate(n = n / sum(n)) %>%
  pivot_wider(names_from = type, values_from = n)

This produces the output below:

# A tibble: 10 x 6
   mass  B_val    FN    TN    TP    FP
   <lgl> <dbl> <int> <int> <int> <int>
 1 FALSE   0.1     3     1    NA    NA
 2 FALSE   0.2     3     1    NA    NA
 3 FALSE   0.3     2     2    NA    NA
 4 FALSE   0.5     1     3    NA    NA
 5 FALSE   0.9    NA     4    NA    NA
 6 TRUE    0.1    NA    NA     2    NA
 7 TRUE    0.2    NA    NA     1     1
 8 TRUE    0.3    NA    NA    NA     2
 9 TRUE    0.5    NA    NA    NA     2
10 TRUE    0.9    NA    NA    NA     2

Or if looking at proportions:

   mass  B_val    FN    TN    TP    FP
   <lgl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 FALSE   0.1  0.75  0.25  NA    NA  
 2 FALSE   0.2  0.75  0.25  NA    NA  
 3 FALSE   0.3  0.5   0.5   NA    NA  
 4 FALSE   0.5  0.25  0.75  NA    NA  
 5 FALSE   0.9 NA     1     NA    NA  
 6 TRUE    0.1 NA    NA      1    NA  
 7 TRUE    0.2 NA    NA      0.5   0.5
 8 TRUE    0.3 NA    NA     NA     1  
 9 TRUE    0.5 NA    NA     NA     1  
10 TRUE    0.9 NA    NA     NA     1