0
votes

I want to code a new variable in a dataframe based on a set of rules. I have a dataframe df1 with a subject variable, a time variable, and variables A, B and C, like this:

subject <- c(1,1,1,1,1,1,2,2,2,2,2,2)
time <- c(1,2,3,4,5,6,1,2,3,4,5,6)
A <- c(1,7,7,6,6,5,1,2,3,NA,NA,NA)
B <- c(2,1,1,1,1,1,6,5,4,NA,NA,NA)
C <-c(7,1,6,1,6,1,6,2,4,NA,NA,NA)

df1 <- data.frame(subject,time,A,B,C)

Values in A, B, and C range from 1 (lowest) to 7 (highest), there are also some NA. Now I want to code a new dichotomous variable, newvar. The first row for every subject should always be coded 0. 1 should be coded whenever the variable/s with the highest score (A,B or C) within one row change/s to one or more different variable/s in the next row. It doesn't matter if the value changes from one row to the next within one variable, only if there is a change in which of the three variables has the highest score within one row compared to the previous row.

The examples from df1 should make this clearer:

  • Row 1 is coded 0 because it is the first row for subject 1. C has the highest score among the three variables A, B, and C.

  • In row 2, A has the highest score. Therefore, newvar = 1.

  • In row 3, A still has the highest score, therefore, newvar = 0.

  • In row 4, A still has the highest score --> newvar = 0.

  • In row 5, now A and C both have the highest score, therefore, newvar = 1.

  • In row 6, only A has the highest score again, therefore, newvar = 1.

  • Row 7 is the first row for subject 2, therefore newvar is coded 0.

  • In row 8, newvar should be coded 1, because in the previous row, B and C equally had the highest score, now it is only B.

  • In row 9, newvar should again be coded 1, because now B and C have the highest score again within the row.

  • Rows 10 to 12 should be coded NA.

This is what it should look like:

newvar <-c(0,1,0,0,1,1,0,1,1,NA,NA,NA)
df2 <- data.frame(subject,time,A,B,C,newvar)

I would greatly appreciate any input in how to go about this!

1

1 Answers

1
votes

Here is one approach using tidyverse. First, pivot your data into long form. Then, for each subject time combination, collect the column names that are equal to the highest value for that combination. This is stored as highest_values.

Then, change the group to subject. For each subject check if the time is the lowest value of time - if it is, code as 0 (there are alternative options if you just want to code 0 for the first row independent of time value). If the highest_values does not have any column names, code as NA. If there is a difference between highest_values and the previous row (a change), code as 1. Otherwise, it assumes highest_values has not changed, and code as 0.

library(tidyverse)

df1 %>%
  pivot_longer(cols = -c(subject, time)) %>%
  group_by(subject, time) %>%
  summarise(highest_values = toString(name[which(value == max(value))])) %>%
  group_by(subject) %>%
  mutate(newvar = case_when(
    time == min(time) ~ 0,
    highest_values == "" ~ NA_real_,
    highest_values != lag(highest_values) ~ 1,
    TRUE ~ 0
  )) %>%
  right_join(df1)

Output

   subject  time highest_values newvar     A     B     C
     <dbl> <dbl> <chr>           <dbl> <dbl> <dbl> <dbl>
 1       1     1 "C"                 0     1     2     7
 2       1     2 "A"                 1     7     1     1
 3       1     3 "A"                 0     7     1     6
 4       1     4 "A"                 0     6     1     1
 5       1     5 "A, C"              1     6     1     6
 6       1     6 "A"                 1     5     1     1
 7       2     1 "B, C"              0     1     6     6
 8       2     2 "B"                 1     2     5     2
 9       2     3 "B, C"              1     3     4     4
10       2     4 ""                 NA    NA    NA    NA
11       2     5 ""                 NA    NA    NA    NA
12       2     6 ""                 NA    NA    NA    NA

Edit (2/11/21): Based on the comment below, sometimes there are rows with missing data. In these cases, newvar should reflect the last or most recent "highest_values" excluding those rows.

To do this, would filter out those rows without a "highest_values" value before the group_by. Then, the most recent "highest_values" will be the value that is not missing. Also, you won't need to set newvar as NA - this will happen automatically with the right_join.

Here is the revised code:

df1 %>%
  pivot_longer(cols = -c(subject, time)) %>%
  group_by(subject, time) %>%
  summarise(highest_values = toString(name[which(value == max(value))])) %>%
  filter(highest_values != "") %>%
  group_by(subject) %>%
  mutate(newvar = case_when(
    time == min(time) ~ 0,
    highest_values != lag(highest_values) ~ 1,
    TRUE ~ 0
  )) %>%
  right_join(df1) %>%
  arrange(subject, time)

I added a row of data to demonstrate with an example.

Output

   subject  time highest_values newvar     A     B     C
     <dbl> <dbl> <chr>           <dbl> <dbl> <dbl> <dbl>
 1       1     1 C                   0     1     2     7
 2       1     2 A                   1     7     1     1
 3       1     3 A                   0     7     1     6
 4       1     4 A                   0     6     1     1
 5       1     5 A, C                1     6     1     6
 6       1     6 A                   1     5     1     1
 7       2     1 B, C                0     1     6     6
 8       2     2 B                   1     2     5     2
 9       2     3 B, C                1     3     4     4
10       2     4 NA                 NA    NA    NA    NA
11       2     5 NA                 NA    NA    NA    NA
12       2     6 B, C                0     2     3     3