I just noticed that all answers posted so far (including the accepted answer) do not exactly reproduce OP's expected result:
record color size height weight
1 blue large heavy
1 red
2 green small tall thin
which shows 3 rows although the input data has 4 rows.
If I understand correctly, the key-value-pairs for record 2 can be arranged in one row because there are no duplicate values for the same variable. For record 1, variable color
has two values which appear in rows 1 and 2, resp., as the OP has requested
The record should be repeated if there are multiple values for a
particular variable
All other variables of record 1 have only one value (or none) and are arranged in row 1.
So, for each record a sub-table with a ragged bottom is created where the columns are filled from top to bottom (separately for each column).
I have tried to reproduce this in two different ways: First with data.table
which I am more fluent with and then with dplyr
/tidyr
. Finally, I will propose an alternative presentation of duplicate values using toString()
.
data.table
library(data.table)
library(stringr)
library(forcats)
setDT(df)[, str_split(vars, ", "), by = .(rn = seq_along(vars), record)][
, V1 := str_remove_all(V1, '"')][
, tstrsplit(V1, " = "), by = .(rn, record)][
, dcast(.SD, record + rowid(record, V1) ~ fct_inorder(V1), value.var = "V2")][
, record_1 := NULL][]
record color size height weight
1: 1 blue large <NA> heavy
2: 1 red <NA> <NA> <NA>
3: 2 green small tall thin
This works in 5 steps:
- Split multiple key-value-pairs in each row and arrange in separate rows.
- Remove double quotes.
- Split key-value-pairs and arrange in separate columns.
- Reshape from long to wide format where the rows are given by
record
and by a
count of each individual key within record
using rowid()
and the columns are given by the keys (variables). Using fct_inorder()
ensures the columns are arranged in order of appearance of the variables (just to reproduce exactly OP's expected result).
- Drop the helper column from the final result.
To be even more consistent with OP's expected result, the NA
s can be turned into blanks by adding the parameter fill = ""
to the dcast()
call.
dplyr
/ tidyr
library(dplyr)
library(tidyr)
library(stringr)
df %>%
separate_rows(vars, sep = ", ") %>%
mutate(vars = str_remove_all(vars, '"')) %>%
separate(vars,c("key", "val")) %>%
group_by(record, key) %>%
mutate(keyid = row_number(key)) %>%
pivot_wider(id_cols = c(record, keyid), names_from = key, values_from = val) %>%
arrange(record, keyid) %>%
select(-keyid)
# A tibble: 3 x 5
# Groups: record [2]
record color size height weight
<int> <chr> <chr> <chr> <chr>
1 1 blue large NA heavy
2 1 red NA NA NA
3 2 green small tall thin
The steps are essentially the same as for the data.table
approach. The statements
group_by(record, key) %>%
mutate(keyid = row_number(key))
are a replacement for data.table::rowid()
.
Add the parameter values_fill = list(val = "")
to repalce the NA
s by blank.
Alternative representation
The following does not aim at reproducing OP'S expected result as close as possible but to propose an alternative, more concise representation of the result with one row per record
.
During reshaping, a function can be used to aggregate the data in each cell. The toString()
function concatenates character strings.
library(data.table)
library(stringr)
library(forcats)
setDT(df)[, str_split(vars, ", "), by = .(rn = seq_along(vars), record)][
, V1 := str_remove_all(V1, '"')][
, tstrsplit(V1, " = "), by = .(rn, record)][
, dcast(.SD, record ~ fct_inorder(V1), toString, value.var = "V2")]
record color size height weight
1: 1 blue, red large heavy
2: 2 green small tall thin
or
library(dplyr)
library(tidyr)
library(stringr)
df %>%
separate_rows(vars, sep = ", ") %>%
mutate(vars = str_remove_all(vars, '"')) %>%
separate(vars,c("key", "val")) %>%
pivot_wider(names_from = key, values_from = val, values_fn = list(val = toString))
# A tibble: 2 x 5
record color size height weight
<int> <chr> <chr> <chr> <chr>
1 1 blue, red large NA heavy
2 2 green small tall thin
weight=heavy
is not found in your structure – akrun