Using tidyr or similar to make a tall data set wide, while collapsing multiple values into a vector

Question

I have a set of data that comes out of Matlab, and I want to use it in R. I have a set of subjects, and a set of conditions within each subject. In each condition, each subject produced some data. I wrote this into a "tall" table, like so:

    subject   condition   data
#1  id1       cond1       0.12
#2  id1       cond1       0.43
#3  id1       cond2       1.26
#4  id2       cond1       1.96
#5  id2       cond2       0.24
#6  id2       cond2       0.62
...

As you can see, an issue is that there are not the same number of values in each condition for each subject, and there are not the same number of values in each condition within subjects, either. I'm interested in the distributions of these variables between subjects, so I was hoping to retain raw values in a list in a "wide" data frame, like this:

    subject   condition   data
#1  id1       cond1       c(0.12, 0.43)
#2  id1       cond2       c(1.26)
#3  id2       cond1       c(1.96)
#4  id2       cond2       c(0.24, 0.62)
...

What is the best way of doing this? I have used tidyr::spread() in the past, which does not work here without a unique identifying variable per row, but even if I added then I don't see how it would work.

I also tried using dplyr::group_by(data, subject, condition), but I'm not sure how to proceed from there. Would it be possible to summarise the grouped table by using c() as summary function...? This hasn't worked for me.

As always, thanks for any help!

When you have the same number of columns, your data isn't changing shape from tall to wide - the width (number of columns) is constant! You're just aggregating - summarizing existing rows in a a single row. — Gregor Thomas
I'm not really sure how the representation you want would be useful. The common idiom would be dplyr::group_by then dplyr::summarize() to do whatever you want to do to look at "the distributions of these variables between subjects". Please tell us your end goal, rather than just an intermediate step that you think is necessary (but which really might just be overcomplicating a simple problem).. — Gregor Thomas
Hi Gregor, thanks very much for the comments! I see what you mean - I did ask how to achieve what I wanted, rather than state my problem. The end goal is to literally compare the distributions of these variables - so i.e. I want to plot a histogram of cond. 1 and 2 within one subject, fit a curve, and compare parameters of these curves. After within-subject comparisons I would then also move on to make between-subject comparisons by fitting data to group aggregates, but I figured it would be easiest to first get it all into a tidy data frame. — janfreyberg
PS: I should add that my posted example was massively oversimplified, this is about 50 values or so per subject! — janfreyberg

mtoto mtoto · Accepted Answer · 2016-04-12T16:10:58

You could use aggregate()to create the list column data that consists of numeric vectors.

aggregate(data ~ subject + condition, FUN = list, data = df)
#  subject condition       data
#1     id1     cond1 0.12, 0.43
#2     id2     cond1       1.96
#3     id1     cond2       1.26
#4     id2     cond2 0.24, 0.62

Using tidyr or similar to make a tall data set wide, while collapsing multiple values into a vector

2 Answers