I have a data-wrangling problem.
Per observation, I have data that has a number of exhibited "features", and they result in an outcome as well as the duration of time spent.
So, for example, we have this data.
participantID observation treatment duration outcome feature.A feature.B feature.C feature.D feature.other
1 1 A 3.1 Successful TRUE FALSE TRUE FALSE FALSE
1 2 A 2.1 Successful TRUE TRUE FALSE TRUE FALSE
1 3 A 1.0 Unsuccessful FALSE FALSE FALSE FALSE TRUE
1 4 A 5.1 Successful TRUE FALSE TRUE TRUE FALSE
2 1 B 2.5 Unsuccessful FALSE FALSE FALSE FALSE TRUE
2 2 B 3.1 Unsuccessful TRUE FALSE TRUE FALSE FALSE
2 3 B 2.6 Successful TRUE TRUE FALSE FALSE FALSE
2 4 B 2.8 Successful TRUE FALSE TRUE FALSE FALSE
2 5 B 2 Unsuccessful TRUE TRUE TRUE TRUE FALSE
2 6 B 3.6 Successful TRUE FALSE TRUE FALSE FALSE
3 1 A 3.4 Successful FALSE FALSE FALSE FALSE TRUE
3 2 A 3.5 Unsuccessful TRUE FALSE TRUE FALSE FALSE
The idea is that we have participants who have tried different attempts (observations). These participants were in treatment groups A, or B. Based on their attempts, we identified if they were successful or unsuccessful, as well as the duration of their attempt. We also identify if they used particular features, which are not mutually-exclusive with each other, with the exception of feature.other, which is TRUE only when every other feature is FALSE.
I would like to be able to plot this data in ggplot with a command similar to the following:
ggplot(test, aes(x=observation, y=duration, fill=outcome)) + geom_bar(position="dodge", stat="identity") + facet_grid(participantID ~ feature.A + feature.B + feature.C + feature.D + feature.other)
This command would print a gridded plot of features and participants, and looks something like this:
I'm not sure, however, that this is the right "way" to use the data. One thing I want to be able to do is to somehow collapse the different "feature" variables into one, so I can easily use it (for example, if I want to build stacked bar charts I can just add it as a "feature" column in ggplot).
From what I understand, I should be melting it down to long format, right? But if I do that incorrectly, the feature variables end up duplicating the number of observations because each "TRUE" ends up appearing as an ID and then it ends up that there are problems counting the observations (especially if I do cumsum
s and similar to do additions).
Attempt: reshape(test, direction="long", varying=c("feature.A", "feature.B", "feature.C", "feature.D", "feature.other")) # This gives me some odd output, including "time" as the old feature variable A/B/C/D/other
But then when I plot it gives me output that doesn't make sense (for example, all of the bars end up being the same height).
So my question is:
How can I format related binary factors (like the features) in such a way that it plays nicely with reshape and ggplot, without duplicating the variable output values?
I ask this because if I want to (later) plot a cumulative sum of the durations at each observation, using reshape gets in the way: for example, if I have a reshaped data frame:
ddply(test_long, .(participantID), summarize,cumsum(duration))
will give me incorrect results. I suppose one way to get around this is to simply do the calculation before the melt, but doing things in an order-dependent way is somewhat less than idea.
Maybe this also stems from an understanding of what exactly "long" format and "wide" format are and how they interact with ggplot as well, if someone can briefly point me to explanatory resources to further my understanding I would appreciate it.
dput data below:
test <- structure(list(participantID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 3L), observation = c(1L, 2L, 3L, 4L, 1L, 2L,
3L, 4L, 5L, 6L, 1L, 2L), treatment = structure(c(1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L), .Label = c("A", "B"), class = "factor"),
duration = c(3.1, 2.1, 1, 5.1, 2.5, 3.1, 2.6, 2.8, 2, 3.6,
3.4, 3.5), outcome = structure(c(1L, 1L, 2L, 1L, 2L, 2L,
1L, 1L, 2L, 1L, 1L, 2L), .Label = c("Successful", "Unsuccessful"
), class = "factor"), feature.A = c(TRUE, TRUE, FALSE, TRUE,
FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE), feature.B = c(FALSE,
TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE,
FALSE, FALSE), feature.C = c(TRUE, FALSE, FALSE, TRUE, FALSE,
TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE), feature.D = c(FALSE,
TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
FALSE, FALSE), feature.other = c(FALSE, FALSE, TRUE, FALSE,
TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE)), .Names = c("participantID",
"observation", "treatment", "duration", "outcome", "feature.A",
"feature.B", "feature.C", "feature.D", "feature.other"), class = "data.frame", row.names = c(NA,
-12L))
melt
? – Andy Cliftoncommand()
. Unfortunately, this is a very standard data wrangling approach: all too often things are order-dependent. A script to wrap around your functions will be helpful to keep things working nicely. So - what's your immediate goal? – Andy Clifton