2
votes

I have a logical dataframe like:

> test
  apple apple apple  kiwi kiwi banana banana banana apple orange
1 FALSE  TRUE FALSE FALSE TRUE  FALSE   TRUE   TRUE  TRUE  FALSE
2  TRUE  TRUE  TRUE  TRUE TRUE   TRUE  FALSE   TRUE  TRUE  FALSE
3 FALSE FALSE FALSE  TRUE TRUE  FALSE  FALSE  FALSE FALSE  FALSE

My aim is to combine the column with same column name. That's to say the output should be a dataframe with 4 column (apple, kiwi, banana, orange).

I tried :

testmerge <- df[, !duplicated(colnames(df))]

But the output is not what I look for. For each row given same column name, the output should be True as long there is at least 1 TRUE. For each row given same column name, the output should be False if there is 0 TRUE.

For intance first row first column is should be TRUE instead of FALSE.

Undesired testmerge output:

  apple  kiwi banana orange
1 FALSE FALSE  FALSE  FALSE
2  TRUE  TRUE   TRUE  FALSE
3 FALSE  TRUE  FALSE  FALSE

Desired output:

  apple  kiwi banana orange
1 TRUE   TRUE TRUE   FALSE
2 TRUE   TRUE TRUE   FALSE
3 FALSE  TRUE FALSE  FALSE

Replicate dataframe:

   test <- structure(list(apple = c(FALSE, TRUE, FALSE), apple = c(TRUE, TRUE, 
FALSE), apple = c(FALSE, TRUE, FALSE), kiwi = c(FALSE, TRUE, TRUE
), kiwi = c(TRUE, TRUE, TRUE), banana = c(FALSE, TRUE, FALSE), banana = c(TRUE, 
FALSE, FALSE), banana = c(TRUE, TRUE, FALSE), apple = c(TRUE, TRUE, 
FALSE), orange = c(FALSE, FALSE, FALSE)), .Names = c("apple", "apple", 
"apple", "kiwi", "kiwi", "banana", "banana", "banana", "apple", "orange"), row.names = c(NA, 
-3L), class = "data.frame")
3

3 Answers

3
votes

Using sapply and rowSums:

as.data.frame(
  sapply(unique(colnames(test)),
         function(i){
           rowSums(test[, grepl(i, colnames(test)), drop = FALSE]) > 0})
  )

#output
#   apple kiwi banana orange
# 1  TRUE TRUE   TRUE  FALSE
# 2  TRUE TRUE   TRUE  FALSE
# 3 FALSE TRUE  FALSE  FALSE

We are subsetting datafame based on fruit names, then computing rowSums. TRUE is 1 and FALSE is 0, so rowSums of more than zero will have at least one TRUE value. I have drop = FALSE, so the subset will stay as a dataframe in cases like orange where there is only one column.

Note: If the data is long then Reduce() solution by @akrun works better, but if data is wide then rowSums() is more efficient.

2
votes

There maybe more efficient ways to achieve this, but here's a try

I would suggest to convert the column names to to unique ones using make.unique, then convert to long format, check your condition by a row id and the column names (made unique again) and then convert back to a wide format, something like

library(data.table)
setnames(setDT(test), make.unique(names(test))) # Make column names unique
res <- melt(test[, id := .I], id = "id" # Add a row index and melt by it
           )[, sum(value) > 0, # Check condition >>
               by = .(id, Names = sub("\\..*", "", variable))] # by row id and unique names
dcast(res, id ~ Names, value.var = "V1") # Convert back to wide format
#    id apple banana kiwi orange
# 1:  1  TRUE   TRUE TRUE  FALSE
# 2:  2  TRUE   TRUE TRUE  FALSE
# 3:  3 FALSE  FALSE TRUE  FALSE
2
votes

Another option would be to split the sequence of columns of the dataset by the names of it into a list, loop through the list, subset based on the numeric index, use Reduce to check whether there are any TRUE in each row.

sapply(split(seq_along(test), names(test)), function(i) Reduce(`|`, test[i]))
#     apple banana kiwi orange
#[1,]  TRUE   TRUE TRUE  FALSE
#[2,]  TRUE   TRUE TRUE  FALSE
#[3,] FALSE  FALSE TRUE  FALSE