I have a data frame with two columns and many rows.
The first column is a character vector where each element P is a string concatenating a number (K) of strings with a comma. K is unknown in advance and can vary across rows, such that K = 5 for the first row and K = 3 for the second. The values that are concatenated themselves may or may not be the same across rows, although they do not repeat within a row. We can call these "variable names."
The second column - we can call this "variable values" - is a character vector where each element is also a string concatenating K strings with commas. Importantly, the number of strings concatenated is identical to that of the variable names. Put another way, the variable names column contains a string containing the names of variables and the variable values column contains the values that correspond to the variable names for that row.
Here's a minimal example of my data. Note that the number of substrings in e.g. var_names[i]
equals the same number in values[i]
but need not equal the same as var_names[j]
:
# Example data
data <-
data.frame(
var_names = c(
paste("a", "b", "c", "e", "j", sep = ","),
paste("d", "a", "f", sep = ","),
paste("f", "k", "b", "a", sep = ",")
),
values = c(
paste("212", "12", "sfd", "3", "1", sep = ","),
paste("fds", "23", "g", sep = ","),
paste("df", "sdf", "w2", "w", sep = ",")
),
stringsAsFactors = FALSE
)
Given this data, I am trying to create a data frame where each of the unique values in var_names
is a column name and the values for each column are based on the corresponding index in values
for each row in the data. Specifically, I am looking to produce:
data.frame(a = c("212","23","w"),
b = c("12",NA,"w2"),
c = c("sfd",NA,NA),
d = c(NA,"fds",NA),
e = c("3", NA, NA),
f = c(NA, "g", "df"),
j = c("1"," NA, NA),
k = c(NA,NA,"sdf"))
I was able to produce what I wanted using the below. However, I was wondering whether there might be some function/package that would let me skip some of these steps and accomplish this more quickly. Currently, I create a loop that generates entire data frame for each row and then combine them into a single data frame. My initial thought was to take the var_val
object in my code and use tidyr::pivot_wider()
to generate each row's data frame, but that did not work due to a spec error.
# Split variable names and values into a list
# where each element is a row's values/names
vars_name_l <- strsplit(data$var_names, split = ",")
values_l <- strsplit(data$values, split = ",")
# Initialize a list to store each row's
# data frame
combined <- list()
# Loop through each row's data and generate a
# list of data frames
for (i in 1:length(nrow(data))) {
# Get a row's variable names and values into
# a data frame.
var_val <- data.frame(var_names = vars_name_l[[i]],
values = values_l[[i]],
stringsAsFactors = FALSE)
# Create an empty data frame then add variable
# names and the values for the variables, store in
# our list
df <- as.data.frame(matrix(numeric(), nrow = 0, ncol = length(var_val$var_names)))
colnames(df) <- var_val$var_names
df[1, ] <- var_val$values
combined[[i]] <- df
}
# Collapse list to a single data frame, rearrange
result <- bind_rows(combined)
result[ ,order(colnames(result))]