4
votes

I have a dataset with following columns :

    ID  Measure1    Measure2    XO  X1  x2  x3  x4  x5
    1   30  2   item1   item1   item23  NA      item6   item9
    2   23  2   item1   item323 item1   item4   item5   NA      
    3   2   2   item1   item78  item3   NA      item1   item5

and I want to create a flag variable with this this piece of SAS Code in R:

 data dt2;
 set dt1;
 array x {5} x1 - x5;
 do i=1 to 5;
 if x0=x{i} then do; 
 flag=i;
 leave;
 end;
 end;
 drop i;
 run;

The goal is to be able to browse thorugh the values of x1-x5 and see where xo is equal to any of them and return the position , for example if item1 is found at x1 then return to me value 1 if found in position x3 return 3.

The end product would look something like this:

    ID  Measure1    Measure2    XO  X1  x2  x3  x4  x5  Flag
    1   30  2   item1   item1   item23  NA          item6   item9   1
    2   23  2   item1   item323 item1   item4       item5   NA      2
    3   2   2   item1   item78  item3   NA          item1   item5   4

Keep in mind that there might be cases where all rows rom x1-x5 contain NA, in that case i would like to return blank, is this possible?

I haven’t been able to find in R something equivalent in the sense of being dynamic (without writing multiple if statements or case when with sqldf) because now the columns might be 5 but can alter in the future to up to 20.

Any ideas?

2
May I recommend: springer.com/us/book/9780387094182 - this is not an answer, but is a useful resource if you are going to R from SAS.TARehman

2 Answers

4
votes

We can use max.col

df1$Flag <- max.col(df1$XO[row(df1[-1])]==df1[-1], 'first')
df1
#    XO      X1     x2     x3    x4    x5 Flag
#1 item1   item1 item23  item5 item6 item9    1
#2 item1 item323  item1  item4 item5 itm87    2
#3 item1  item78  item3 item98 item1 item5    4

Update

Based on the updated dataset, we can replace the NA elements in the logical matrix with FALSE and then use max.col. If there are no TRUE values in a row, we can make it to NA by getting the rowSums, check whether it is 0, change values that are 0 to NA (NA^..) and multiply with max.col(..

df3 <- df2[5:ncol(df2)]
i1 <- df2$XO[row(df3)]==df3
i2 <- replace(i1, is.na(i1), FALSE)
df2$Flag <- max.col(i2, 'first') * NA^(rowSums(i2)==0)
df2
#  ID Measure1 Measure2    XO      X1     x2    x3    x4    x5 Flag
#1  1       30        2 item1   item1 item23  <NA> item6 item9    1
#2  2       23        2 item1 item323  item1 item4 item5  <NA>    2
#3  3        2        2 item1  item78  item3  <NA> item1 item5    4
1
votes

1) base R as.matrix(DF[5:9]) == XO produces a logical matrix with the same dimensions as DF[5:9]. Apply wm to each row. wm is the same as which.max except it returns an NA in those cases where there are no TRUE values, i.e. if there are only NA and FALSE values in a row. If there are no such cases then we could use which.max in place of wm reducing the solution to a single line. If there is more than one TRUE value in a row it will use the first.

wm <- function(x) if (isTRUE(any(x))) which.max(x) else NA
transform(DF, Flag = apply(as.matrix(DF[-(1:4)]) == XO, 1, wm))

giving:

  ID Measure1 Measure2    XO      x1     x2    x3    x4    x5 Flag
1  1       30        2 item1   item1 item23  <NA> item6 item9    1
2  2       23        2 item1 item323  item1 item4 item5  <NA>    2
3  3        2        2 item1  item78  item3  <NA> item1 item5    4

2) dplyr/tidyr The main complexity here is that the data is not normalized but instead the x1, ..., xn columns are written in wide form. To address this we can use tidyr's gather to convert the data frame to long form, checking whether the item equals XO and joining back to the original data frame:

library(dplyr)
library(tidyr)
DF %>% 
   left_join(DF %>% gather(Flag, item, -(1:4)) %>% filter(item == XO)) %>%
   select(-item) %>%
   mutate(Flag = match(Flag, names(DF)[-(1:4)]))

giving:

  ID Measure1 Measure2    XO      X1     x2    x3    x4    x5 Flag
1  1       30        2 item1   item1 item23  <NA> item6 item9    1
2  2       23        2 item1 item323  item1 item4 item5  <NA>    2
3  3        2        2 item1  item78  item3  <NA> item1 item5    4

3) Another base solution This base solution is inspired by the dplyr/tidyr solution. It uses reshape to convert the original data frame to long form, it picks out the equal rows and then merges the Flag variable to the original data frame:

r <- reshape(DF, list(names(DF)[-(1:4)]), "X", "Flag", direction = "long")
s <- subset(r, X == XO)[c("ID", "Flag")]
merge(DF, s, all.x = TRUE)

giving:

  ID Measure1 Measure2    XO      X1     x2    x3    x4    x5 Flag
1  1       30        2 item1   item1 item23  <NA> item6 item9    1
2  2       23        2 item1 item323  item1 item4 item5  <NA>    2
3  3        2        2 item1  item78  item3  <NA> item1 item5    4

Note: To make this reproducible we provide the code to create the input below. Next time please provide this yourself.

Lines <- "  ID  Measure1    Measure2    XO  x1  x2  x3  x4  x5
    1   30  2   item1   item1   item23  NA      item6   item9
    2   23  2   item1   item323 item1   item4   item5   NA      
    3   2   2   item1   item78  item3   NA      item1   item5"

DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)