1
votes

I have a dataframe, in which one of the columns contains strings, separated by commas. I was wondering if there is an efficient way to make these comma-separated values into new column headers, and make these new column values binary if they were part of the original row. A sample of my data can be reproduced below:

data <- structure(list(id = c(6901257L, 6304928L, 7919400L), amenities = 
c("Wireless Internet,Air conditioning,Kitchen,Heating,Family/kid 
friendly,Essentials,Hair dryer,Iron,translation missing: 
en.hosting_amenity_50",  "Wireless Internet,Air 
conditioning,Kitchen,Heating,Family/kid friendly,Washer,Dryer,Smoke 
detector,Fire extinguisher,Essentials,Shampoo,Hangers,Hair 
dryer,Iron,translation missing: en.hosting_amenity_50",  "TV,Cable 
TV,Wireless Internet,Air 
conditioning,Kitchen,Breakfast,Buzzer/wireless 
intercom,Heating,Family/kid friendly,Smoke detector,Carbon monoxide 
detector,Fire extinguisher,Essentials,Shampoo,Hangers,Hair 
dryer,Iron,Laptop friendly workspace,translation missing: 
en.hosting_amenity_50" )), .Names = c("id", "amenities"), class = 
"data.frame", row.names = c(NA,  3L))

I have an inefficient way of producing my results, which is to make the data into a long format, and then use dcast in reshape2. This inefficient method can be reproduced by:

data.long <- data %>%
mutate(amenities = strsplit(as.character(amenities), ",")) %>%
unnest(amenities)

data.long$amenities.value <- 1

data.wide <- reshape2::dcast(data.long, id ~ amenities, value.var = 
"amenities.value") #desired result

Is there a more efficient way to get the desired result from the original data structure?

2

2 Answers

2
votes

Here is an approach using the library splitstackshape:

library(splitstackshape) 
library(tidyverse)

cSplit(df,  "amenities", sep = ",", direction = "long") %>%
  mutate(value = 1) %>%
  spread(amenities, value) -> df.wide

all.equal(df.wide, data.wide)
#TRUE

As per @A5C1D2H2I1M1N2O1R2T1 a more dense and faster solution is

cSplit_e(data, "amenities", ",", mode = "binary", type = "character", drop = TRUE)
0
votes

Using tidyverse only

library(tidyverse)
data %>% 
  separate_rows(amenities, sep = ",") %>% 
  table() %>% 
  data.frame() %>% 
  spread(amenities,Freq)