ifelse formatting for groups of multiple columns R

Question

My df contains series of columns with similar names that are grouped every three columns, similar to this:

>df<-data.frame(c(0,1,4,5),c(0,1,3,3),c(0,1,1,1),c(0,1,1,1),c(0,1,1,1),c(0,1,1,1),c(0,8,1,9),c(6,1,1,1),c(5,1,3,4))

 >names(df)<-c("AA1","AA2","AA3","BB1","BB2","BB3","CC1","CC2","CC3")

> df

AA1 AA2 AA3 BB1 BB2 BB3 CC1 CC2 CC3

1 0 0 0 0 0 0 0 3 3

2 1 1 1 1 1 1 8 1 1

3 4 6 1 1 1 1 1 1 3

4 5 5 1 1 1 1 9 1 4

This essentially shows 3 different measurements (1,2,3) per examination type(AA,BB,CC) for 4 patients. In reality I have a huge dataset with 3 measurements for 10 different examinations on 2,000 patients. I would like to add a new column with classification of disease as follows: If the score for at least one measurement per examination (XX1,XX2,XX2 where XX=AA or BB or CC) is >4 then the patient has the disease. So the new dataset would look like that:

AA1 AA2 AA3 BB1 BB2 BB3 CC1 CC2 CC3 DISEASE

1 0 0 0 0 0 0 0 3 3 0

2 1 1 1 1 1 1 8 1 1 1

3 4 6 1 1 1 1 1 1 3 1

4 5 5 1 1 1 1 9 1 4 1

Roland Roland · Accepted Answer · 2014-08-04T15:14:24

df<-data.frame(c(0,1,4,5),c(0,1,3,3),c(0,1,1,1),c(0,1,1,1),c(0,1,1,1),c(0,1,1,1),c(0,8,1,9),c(6,1,1,1),c(5,1,3,4))

names(df)<-c("AA1","AA2","AA3","BB1","BB2","BB3","CC1","CC2","CC3")

A solution with your data format:

rowSums(df > 4) > 0
#[1]  TRUE  TRUE FALSE  TRUE

This uses the fact that logical values get coerced to 0 and 1 when calculating their sum.

But tidy data is usually preferable:

df$id <- rownames(df)
library(reshape2)
DF <- melt(df, id.var="id")
DF$exam <- gsub("[[:digit:]+]", "", DF$variable)
DF$meas <- as.numeric(gsub("[[:alpha:]+]", "", DF$variable))

head(DF)
#  id variable value exam meas
#1  1      AA1     0   AA    1
#2  2      AA1     1   AA    1
#3  3      AA1     4   AA    1
#4  4      AA1     5   AA    1
#5  1      AA2     0   AA    2
#6  2      AA2     1   AA    2


#Is patient diseased?
library(plyr)
ddply(DF, .(id), summarize, disease = any(value > 4))
#  id disease
#1  1    TRUE
#2  2    TRUE
#3  3   FALSE
#4  4    TRUE

#Which exam was positive?
ddply(DF, .(id, exam), summarize, disease = any(value > 4))
#   id exam disease
#1   1   AA   FALSE
#2   1   BB   FALSE
#3   1   CC    TRUE
#4   2   AA   FALSE
#5   2   BB   FALSE
#6   2   CC    TRUE
#7   3   AA   FALSE
#8   3   BB   FALSE
#9   3   CC   FALSE
#10  4   AA    TRUE
#11  4   BB   FALSE
#12  4   CC    TRUE

ifelse formatting for groups of multiple columns R

2 Answers