1
votes

I have a survey data set like this:

df <- data.frame(
  employment = 0.45,
  income = 0.3,
  incomeFU1 = 0.4,
  married = 0.1,
  employmentFU1 = 0.7,
  employmentFU2 = 0.8,
  incomeFU2 = 0.8,
  smokingFU1 = 0.6,
  smokingFU3 = 0.1,
  ageFU3 = 0.9,
  marriedFU2 = 0.3
)

In this data set, individuals were asked about their employment status, income etc. The data is on an aggregrate level, think of this as the proportion of all people that are employment, mean income etc. Therefore the data set has only one line.

Individuals in this survey were asked at baseline and 3 follow-ups. baseline variables have no ending string, follow-up answers have an ending like "FU1" for follow-up 1 and so on.

I now have a second list of these variables:

l <- list()
l[[1]] <- c("employment", "income", "married")
l[[2]] <- c("employmentFU1", "incomeFU1", "smokingFU1")
l[[3]] <- c("employmentFU2", "incomeFU2", "marriedFU2")
l[[4]] <- c("smokingFU3", "ageFU3")

the first list item has baseline variables, the second list item has follow-up 1 variables, the third has follow-up 2 etc.

Note that some variables are available in 2 or three (sometimes even all) follow-ups, some only appear once.

I now want to reshape this data frame based on the list variables to a matrix or data frame like this:

employment      income         married              NA          NA
employmentFU1   incomeFU1           NA      smokingFU1          NA
employmentFU2   incomeFU2   marriedFU2              NA          NA
           NA          NA           NA      smokingFU3      ageFU3

the number of rows in this matrix is the number of list elements, 4 in this case.

I tried something like this, but did not get very far:

m <- matrix()
m[1,1] <- df[, l[[1]][1]]
m[1,2] <- l[[2]][str_detect(l[[1]][1], l[[2]])]
1
Should smokingFU3 be in the forth row (not third as in the example)?storaged
@storaged you are right, sorry, I corrected thatspore234
I am just curious, does the solution below work for you?storaged

1 Answers

1
votes

This is how I would attempt to that problem using stringr. Probably there might exist something more efficient

library(stringr)
table <- str_match(unlist(l), "(.*?)($|FU[0-9]+?)")
table[table==""] <- "FU0" ## "" is problematic

m <- matrix(NA, length(unique(table[,3])), length(unique(table[,2])))
colnames(m) <- unique(table[,2])
rownames(m) <- unique(table[,3])

foo <- apply(table, 1, function(row) m[row[3],row[2]] <<- row[1])

print(m)
#    employment      income      married      smoking      age
#FU0 "employment"    "income"    "married"    NA           NA
#FU1 "employmentFU1" "incomeFU1" NA           "smokingFU1" NA
#FU2 "employmentFU2" "incomeFU2" "marriedFU2" NA           NA
#FU3 NA              NA          NA           "smokingFU3" "ageFU3"