splitting all the columns in a dataframe based on their value and delimiter

Question

I have a dataframe as follows:

df <- data.frame(s1=c("a","a/b","b","a","a/b"),s2=c("ab/bb","bb","ab","ab","bb"),s3=c("Doa","Doa","Dob/Doa","Dob/Doa","Dob"))

   s1    s2      s3
1   a ab/bb     Doa
2 a/b    bb     Doa
3   b    ab Dob/Doa
4   a    ab Dob/Doa
5 a/b    bb     Dob

Each column could take one of two values or both separated by a "/". I would like to break these down into binary sets of columns based on their values.

The desired data frame would be:

   a   b   ab   bb   Doa   Dob
1  1   0    1    1    1    0
2  1   1    0    1    1    0
3  0   1    1    0    1    1
4  1   0    1    0    1    1
5  1   1    0    1    0    1

I tried doing this with tidyr::separate and tapply, though it got fairly complicated as I had to specify column names for every pair. There were many columns.

IceCreamToucan IceCreamToucan · Accepted Answer · 2019-12-30T18:56:25

First make sure your data is character and not factor. Then split into one data.frame for each row and for each of those rows, take the str_split on '/', set the names equal to the values, and make it a list. Now you can bind these results together, and set all non-na values to 1 at the end.

library(tidyverse) # dplyr, + stringr for str_split, + purrr for map

df %>% 
  mutate_all(as.character) %>% 
  split(seq(nrow(.))) %>% 
  map(~ str_split(., '/') %>% unlist %>% setNames(., .) %>% as.list) %>% 
  bind_rows %>% 
  mutate_all(~as.numeric(!is.na(.)))

# # A tibble: 5 x 6
#       a    ab    bb   Doa     b   Dob
#   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1     1     1     1     1     0     0
# 2     1     0     1     1     1     0
# 3     0     1     0     1     1     1
# 4     1     1     0     1     0     1
# 5     1     0     1     0     1     1

Another similar option (same output)

df %>% 
  mutate_all(as.character) %>% 
  split(seq(nrow(.))) %>% 
  map(~ str_split(., '/') %>% unlist %>% table %>% as.list) %>% 
  bind_rows %>% 
  mutate_all(replace_na, 0)

Or you could convert to long first then back to wide, similar to akrun's answer

library(data.table)
setDT(df)
library(magrittr)

melt(df[, r := 1:.N], 'r') %>% 
  .[, .(value = strsplit(value, '/')[[1]]), .(r, variable)] %>% 
  dcast(r ~ value, fun.aggregate = length)

#    r Doa Dob a ab b bb
# 1: 1   1   0 1  1 0  1
# 2: 2   1   0 1  0 1  1
# 3: 3   1   1 0  1 1  0
# 4: 4   1   1 1  1 0  0
# 5: 5   0   1 1  0 1  1

splitting all the columns in a dataframe based on their value and delimiter

2 Answers