Add a column based on the values of other two columns in the same data frame in r

1

votes

Suppose I have a data frame with three variables as the one bellow, I want to add a fourth variable whose values are based on the values on the second and third variable, eg. if var2 = var3 then var4 = 3, if var2 = Y and var3 = NA then var4 = 1 and, if var2 = NA and var3 = Y then var4 = 2.

var1 var2 var3
m01  Y    NA    
m02  Y    NA
m03  NA   Y
m04  NA   Y
m05  Y    Y
m06  Y    NA
m07  Y    Y

I would like to get a data frame like this:

var1 var2 var3 var4
m01  Y    NA   1
m02  Y    NA   1
m03  NA   Y    2
m04  NA   Y    2
m05  Y    Y    3
m06  Y    NA   1
m07  Y    Y    3

I am trying with ifelse but I haven't had success.

Any ideas?

r dataframe

df$var4 <- apply(df[-1], 1, function(x){sum(which(x == 'Y'))}) - alistaire

This post might help you understand how to use nested ifelse. - Ronak Shah

2

votes

Try this:

library(dplyr)
df <- data.frame(var1 = paste0("m0",1:7), 
             var2 = c(rep("Y",2) ,rep(NA, 2), rep("Y", 3)),
             var3 = c(rep(NA, 2), rep("Y", 3), NA, "Y"))
mutate(df, var4 = if_else(var2 ==  "Y", 
                      if_else(var3 == "Y", 3, 1,1), 
                      2, 2))

if_else from dplyr package will handle the case of missing number (NA) as well

6

votes

Everyone forgets about poor old interaction:

c(3,2,1,4)[interaction(lapply(dat[-1], is.na))]
#[1] 1 1 2 2 3 1 3

1

votes

Using ifelse:

df$var4 <- ifelse(df$var2 == df$var3, 3, 
             ifelse(df$var3 == "NA" & df$var2 == "y", 1, 
               ifelse(df$var2 == "NA" & df$var3 == "y", 2, "?")))

works if "NA" are factor values. Otherwise replace df$var3 == "NA" with is.na(df$var3) and df$var2 == "NA" with is.na(df$var2)

1

votes

A handful of options:

df <- read.table(text = 'var1 var2 var3
m01  Y    NA    
m02  Y    NA
m03  NA   Y
m04  NA   Y
m05  Y    Y
m06  Y    NA
m07  Y    Y', head = TRUE, stringsAsFactors = FALSE)

A typical base R approach would be to apply to iterate rowwise across the requisite columns. This is silently coercing to a matrix, which is why some avoid this approach.

apply(df[-1], 1, function(x){sum(which(x == 'Y'))})
#> [1] 1 1 2 2 3 1 3

You could translate it to dplyr with rowwise, which does not coerce to a matrix, but is not usually the fastest possible approach:

library(dplyr)

df %>% 
    rowwise() %>% 
    mutate(var4 = sum(which(c(var2, var3) == 'Y')))
#> Source: local data frame [7 x 4]
#> Groups: <by row>
#> 
#> # A tibble: 7 x 4
#>    var1  var2  var3  var4
#>   <chr> <chr> <chr> <int>
#> 1   m01     Y  <NA>     1
#> 2   m02     Y  <NA>     1
#> 3   m03  <NA>     Y     2
#> 4   m04  <NA>     Y     2
#> 5   m05     Y     Y     3
#> 6   m06     Y  <NA>     1
#> 7   m07     Y     Y     3

This also will fail as-is for factors (which get converted to integers by c), but they can be coerced beforehand or internally, or you could use is.na instead of checking equality.

More creative base options include pasting the columns together to create a factor that can be deliberately leveled for coercion to integer:

as.integer(factor(paste0(df$var2, df$var3), levels = c('YNA', 'NAY', 'YY')))
#> [1] 1 1 2 2 3 1 3

or using do.call to pass a list of a function and each desired variable of df (flattened with c) to mapply:

do.call(mapply, 
        c(function(...){sum(which(!is.na(c(...))))}, 
          df[-1], 
          USE.NAMES = FALSE))
#> [1] 1 1 2 2 3 1 3

If you really want the ifelse logic, dplyr::case_when lets you use cascading conditionals without the messy syntax:

df %>% mutate(var4 = case_when(var2 == 'Y' & var3 == 'Y' ~ 3,
                               var2 == 'Y' ~ 1, 
                               var3 == 'Y' ~ 2))
#>   var1 var2 var3 var4
#> 1  m01    Y <NA>    1
#> 2  m02    Y <NA>    1
#> 3  m03 <NA>    Y    2
#> 4  m04 <NA>    Y    2
#> 5  m05    Y    Y    3
#> 6  m06    Y <NA>    1
#> 7  m07    Y    Y    3

Add a column based on the values of other two columns in the same data frame in r

4 Answers