separate a column into multiple variables with unique column names in R

2

votes

Here is how I want my dataframe to look:

record    color    size    height    weight
1         blue     large             heavy
1         red                        
2         green    small   tall      thin

However, the data (df) appears as follows:

record    vars
1         color = "blue", size = "large"
2         color = "green", size = "small"
2         height = "tall", weight = "thin"
1         color = "red", weight = "heavy"

The code for df

structure(list(record = c(1L, 2L, 2L, 1L), vars = structure(c(1L, 
                                                              2L, 4L, 
3L), .Label = c("color = \"blue\", size = \"large\"", 

"color = \"green\", size = \"small\"", "color = \"red\", weight = 
\"heavy\"", 

"height = \"tall\", weight = \"thin\""), class = "factor")), class = 
"data.frame", row.names = c(NA, 

-4L))

For each record, I would like to separate the vars column by the "," delimiter, and create a new column with the indicated variable name...The record should be repeated if there are multiple values for a particular variable

I know that to do this with tidyverse I will need to use dplyr::group_by and dplyr::separate, however I'm not clear how to incorporate the new variable names in the "into" parameter for separate. Do I need some type of regular expression to identify any text prior to an equal sign "=" as the new variable name in "into"?? Any suggestions much welcome!

df %>%
  group_by(record) %>%
  separate(col = vars, into = c(regex expression?? / character vector?), sep = ",")

r regexdplyr

weight=heavy is not found in your structure – akrun

How come the output you want doesn't have weight = heavy for the second record 1? – camille

It was a typo, very sorry, I have changed this – mdb_ftl

6

votes

Since the columns are already almost written as R code defining a list, you could parse/eval them and then unnest_wider

library(tidyverse)

df %>% 
  mutate(vars = map(vars, ~ eval(parse_expr(paste('list(', .x, ')'))))) %>% 
  unnest_wider(vars)

# record color size  height weight
#    <int> <chr> <chr> <chr>  <chr> 
# 1      1 blue  large NA     NA    
# 2      2 green small NA     NA    
# 3      2 NA    NA    tall   thin

2

votes

Here is one option with tidyverse. Create a sequence column 'rn', then separate_rows of the 'vars' column based on the ,, remove the quotes with str_remove_all, separate the column into two, and reshape from 'long' to 'wide' with pivot_wider

library(dplyr)
library(tidyr)
library(stringr)

df %>%
    mutate(rn = row_number()) %>%
    separate_rows(vars, sep=",\\s*\\n*") %>%
    mutate(vars = str_remove_all(vars, '"')) %>%
    separate(vars, into = c("vars1", "vars2"), sep="\\s*=\\s*") %>% 
    pivot_wider(names_from = vars1, values_from = vars2, 
         values_fill = list(vars2 = '')) %>%
    select(-rn)
# A tibble: 3 x 5
#  record color size  height weight
#   <int> <chr> <chr> <chr>  <chr> 
#1      1 blue  large ""     ""    
#2      2 green small ""     ""    
#3      2 ""    ""    tall   thin

0

votes

Another way is to convert to 2-column-matrices and merge. We'll need a helper FUNction that converts a vector into a matrix with first row as the header.

FUN <- function(x) {m <- matrix(x, 2);as.data.frame(rbind(`colnames<-`(m, m[1, ])[-1, ]))}

Then just get rid of non-character stuff and merge.

l <- lapply(strsplit(trimws(gsub("\\W+", " ", as.character(dat$vars))), " "), FUN)       
l <- Map(`[<-`, l, 1, "record", dat$record)     # cbind record column
Reduce(function(...) merge(..., all=TRUE), l)   # merge
#   record color weight  size height
# 1      1  blue   <NA> large   <NA>
# 2      1   red  heavy  <NA>   <NA>
# 3      2 green   thin small   tall

0

votes

I just noticed that all answers posted so far (including the accepted answer) do not exactly reproduce OP's expected result:

record    color    size    height    weight
1         blue     large             heavy
1         red                        
2         green    small   tall      thin

which shows 3 rows although the input data has 4 rows.

If I understand correctly, the key-value-pairs for record 2 can be arranged in one row because there are no duplicate values for the same variable. For record 1, variable color has two values which appear in rows 1 and 2, resp., as the OP has requested

The record should be repeated if there are multiple values for a particular variable

All other variables of record 1 have only one value (or none) and are arranged in row 1.

So, for each record a sub-table with a ragged bottom is created where the columns are filled from top to bottom (separately for each column).

I have tried to reproduce this in two different ways: First with data.table which I am more fluent with and then with dplyr/tidyr. Finally, I will propose an alternative presentation of duplicate values using toString().

`data.table`

library(data.table)
library(stringr)
library(forcats)
setDT(df)[, str_split(vars, ", "), by = .(rn = seq_along(vars), record)][
  , V1 := str_remove_all(V1, '"')][
    , tstrsplit(V1, " = "), by = .(rn, record)][
      , dcast(.SD, record + rowid(record, V1) ~ fct_inorder(V1), value.var = "V2")][
        , record_1 := NULL][]

   record color  size height weight
1:      1  blue large   <NA>  heavy
2:      1   red  <NA>   <NA>   <NA>
3:      2 green small   tall   thin

This works in 5 steps:

Split multiple key-value-pairs in each row and arrange in separate rows.
Remove double quotes.
Split key-value-pairs and arrange in separate columns.
Reshape from long to wide format where the rows are given by record and by a count of each individual key within record using rowid() and the columns are given by the keys (variables). Using fct_inorder() ensures the columns are arranged in order of appearance of the variables (just to reproduce exactly OP's expected result).
Drop the helper column from the final result.

To be even more consistent with OP's expected result, the NAs can be turned into blanks by adding the parameter fill = "" to the dcast() call.

`dplyr` / `tidyr`

library(dplyr)
library(tidyr)
library(stringr)
df %>% 
  separate_rows(vars, sep = ", ") %>% 
  mutate(vars = str_remove_all(vars, '"')) %>% 
  separate(vars,c("key", "val")) %>% 
  group_by(record, key) %>% 
  mutate(keyid = row_number(key)) %>% 
  pivot_wider(id_cols = c(record, keyid), names_from = key, values_from = val) %>% 
  arrange(record, keyid) %>% 
  select(-keyid)

# A tibble: 3 x 5
# Groups:   record [2]
  record color size  height weight
   <int> <chr> <chr> <chr>  <chr> 
1      1 blue  large NA     heavy 
2      1 red   NA    NA     NA    
3      2 green small tall   thin

The steps are essentially the same as for the data.table approach. The statements

  group_by(record, key) %>% 
  mutate(keyid = row_number(key))

are a replacement for data.table::rowid().

Add the parameter values_fill = list(val = "") to repalce the NAs by blank.

Alternative representation

The following does not aim at reproducing OP'S expected result as close as possible but to propose an alternative, more concise representation of the result with one row per record.

During reshaping, a function can be used to aggregate the data in each cell. The toString() function concatenates character strings.

library(data.table)
library(stringr)
library(forcats)
setDT(df)[, str_split(vars, ", "), by = .(rn = seq_along(vars), record)][
  , V1 := str_remove_all(V1, '"')][
    , tstrsplit(V1, " = "), by = .(rn, record)][
      , dcast(.SD, record ~ fct_inorder(V1), toString, value.var = "V2")]

   record     color  size height weight
1:      1 blue, red large         heavy
2:      2     green small   tall   thin

or

library(dplyr)
library(tidyr)
library(stringr)
df %>% 
  separate_rows(vars, sep = ", ") %>% 
  mutate(vars = str_remove_all(vars, '"')) %>% 
  separate(vars,c("key", "val")) %>% 
  pivot_wider(names_from = key, values_from = val, values_fn = list(val = toString))

# A tibble: 2 x 5
  record color     size  height weight
   <int> <chr>     <chr> <chr>  <chr> 
1      1 blue, red large NA     heavy 
2      2 green     small tall   thin

separate a column into multiple variables with unique column names in R

4 Answers

data.table

dplyr / tidyr

Alternative representation

`data.table`

`dplyr` / `tidyr`