1
votes

There are no recent answers to this question using the current tidyverse verbs (R 4.1 & tidyverse 1.3.1 in my case). I've tried using mutate with both case_when() and ifelse() with select_if() to conditionally fill a new variable with a value calculated from the number of TRUE values in specific other columns by row but neither seem to be filtering the correct columns to calculate from, as intended. I could probably pivot longer to replace my column groupings and avoid the need to filter which columns are used in the mutate calculation but I want to keep one response per row for merging later. Here's a reproducible example.

library(tidyverse)
set.seed(195)

# create dataframe
response_id <- rep(1:461)
questions <- c("overall","drought","domestic","livestock","distance")
answers <- c("a","b","c","d","e")
colnames <- apply(expand.grid(questions, answers), 1, paste, collapse="_")
df <- tibble(response_id)
# data is actually an unknown mix of TRUE and FALSE values in all columns but just doing that for one column for now for simplicity
df[,colnames] = FALSE
df$overall_a[sample(nrow(df),100)] <- TRUE

# using ifelse and select if to filter which columns to sum
df %>%
 mutate(positive = ifelse(select_if(isTRUE), sum(str_detect(colnames(df), "a|b")), NA)) %>%
 mutate(negative = ifelse(select_if(isTRUE), sum(str_detect(colnames(df), "c|d|e")), NA)) %>%
 select(response_id, positive, negative)

# using case_when
df %>%
 mutate(positive = case_when(TRUE ~ sum(str_detect(colnames(df), "a|b"))), NA) %>%
 mutate(negative = case_when(TRUE ~ sum(str_detect(colnames(df), "c|d|e"))), NA) %>%
 select(response_id, positive, negative)

The desired output should be as follows. Thanks for any help on this!

# A tibble: 461 × 3
   response_id positive negative
         <int>    <int>    <int>
 1           1       0       0
 2           2       0       0
 3           3       0       0
 4           4       0       0
 5           5       1       0
 6           6       1       0
 7           7       0       0
 8           8       1       0
 9           9       0       0
10          10       1       0
# … with 451 more rows

1

1 Answers

3
votes

Having data in column names is not considered "tidy" and the "tidyverse" works best with tidy data. Rather than hacking against the column names, the pivoting approach would be the most consistent with the tidy philosophy. Plus it will scale better for more categories. For example

df %>% 
  pivot_longer(-response_id) %>% 
  separate(name, into=c("category", "code")) %>% 
  mutate(sentiment=case_when(
    code %in% c("a", "b") ~ "positive", 
    code %in% c("c", "d", "e") ~ "negative")) %>% 
  group_by(response_id, sentiment) %>% 
  summarize(count=sum(value)) %>% 
  pivot_wider(response_id, names_from=sentiment, values_from=count)

It's not as concise but it more directly says what it's doing.

But if you really want to keep data in the row names, you can perform rowwise summaries use c_across() with the latest dplyr

df %>% 
  rowwise() %>% 
  mutate(
    positive=sum(c_across(ends_with(c("_a", "_b")))),
    negative=sum(c_across(ends_with(c("_c", "_d", "_e"))))) %>% 
  select(response_id, positive, negative)