2
votes

I have a data-wrangling problem.

Per observation, I have data that has a number of exhibited "features", and they result in an outcome as well as the duration of time spent.

So, for example, we have this data.

participantID   observation treatment   duration    outcome feature.A   feature.B   feature.C   feature.D   feature.other
1   1   A   3.1 Successful  TRUE    FALSE   TRUE    FALSE   FALSE
1   2   A   2.1 Successful  TRUE    TRUE    FALSE   TRUE    FALSE
1   3   A   1.0 Unsuccessful    FALSE   FALSE   FALSE   FALSE   TRUE
1   4   A   5.1 Successful  TRUE    FALSE   TRUE    TRUE    FALSE
2   1   B   2.5 Unsuccessful    FALSE   FALSE   FALSE   FALSE   TRUE
2   2   B   3.1 Unsuccessful    TRUE    FALSE   TRUE    FALSE   FALSE
2   3   B   2.6 Successful  TRUE    TRUE    FALSE   FALSE   FALSE
2   4   B   2.8 Successful  TRUE    FALSE   TRUE    FALSE   FALSE
2   5   B   2   Unsuccessful    TRUE    TRUE    TRUE    TRUE    FALSE
2   6   B   3.6 Successful  TRUE    FALSE   TRUE    FALSE   FALSE
3   1   A   3.4 Successful  FALSE   FALSE   FALSE   FALSE   TRUE
3   2   A   3.5 Unsuccessful    TRUE    FALSE   TRUE    FALSE   FALSE

The idea is that we have participants who have tried different attempts (observations). These participants were in treatment groups A, or B. Based on their attempts, we identified if they were successful or unsuccessful, as well as the duration of their attempt. We also identify if they used particular features, which are not mutually-exclusive with each other, with the exception of feature.other, which is TRUE only when every other feature is FALSE.

I would like to be able to plot this data in ggplot with a command similar to the following:

ggplot(test, aes(x=observation, y=duration, fill=outcome)) + geom_bar(position="dodge", stat="identity") + facet_grid(participantID ~ feature.A + feature.B + feature.C + feature.D + feature.other)

This command would print a gridded plot of features and participants, and looks something like this:

A sample output from ggplot

I'm not sure, however, that this is the right "way" to use the data. One thing I want to be able to do is to somehow collapse the different "feature" variables into one, so I can easily use it (for example, if I want to build stacked bar charts I can just add it as a "feature" column in ggplot).

From what I understand, I should be melting it down to long format, right? But if I do that incorrectly, the feature variables end up duplicating the number of observations because each "TRUE" ends up appearing as an ID and then it ends up that there are problems counting the observations (especially if I do cumsums and similar to do additions).

Attempt: reshape(test, direction="long", varying=c("feature.A", "feature.B", "feature.C", "feature.D", "feature.other")) # This gives me some odd output, including "time" as the old feature variable A/B/C/D/other

But then when I plot it gives me output that doesn't make sense (for example, all of the bars end up being the same height).

So my question is:

How can I format related binary factors (like the features) in such a way that it plays nicely with reshape and ggplot, without duplicating the variable output values?

I ask this because if I want to (later) plot a cumulative sum of the durations at each observation, using reshape gets in the way: for example, if I have a reshaped data frame:

ddply(test_long, .(participantID), summarize,cumsum(duration))

will give me incorrect results. I suppose one way to get around this is to simply do the calculation before the melt, but doing things in an order-dependent way is somewhat less than idea.

Maybe this also stems from an understanding of what exactly "long" format and "wide" format are and how they interact with ggplot as well, if someone can briefly point me to explanatory resources to further my understanding I would appreciate it.

dput data below:

test <- structure(list(participantID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 2L, 2L, 3L, 3L), observation = c(1L, 2L, 3L, 4L, 1L, 2L, 
3L, 4L, 5L, 6L, 1L, 2L), treatment = structure(c(1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L), .Label = c("A", "B"), class = "factor"), 
    duration = c(3.1, 2.1, 1, 5.1, 2.5, 3.1, 2.6, 2.8, 2, 3.6, 
    3.4, 3.5), outcome = structure(c(1L, 1L, 2L, 1L, 2L, 2L, 
    1L, 1L, 2L, 1L, 1L, 2L), .Label = c("Successful", "Unsuccessful"
    ), class = "factor"), feature.A = c(TRUE, TRUE, FALSE, TRUE, 
    FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE), feature.B = c(FALSE, 
    TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, 
    FALSE, FALSE), feature.C = c(TRUE, FALSE, FALSE, TRUE, FALSE, 
    TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE), feature.D = c(FALSE, 
    TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, 
    FALSE, FALSE), feature.other = c(FALSE, FALSE, TRUE, FALSE, 
    TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE)), .Names = c("participantID", 
"observation", "treatment", "duration", "outcome", "feature.A", 
"feature.B", "feature.C", "feature.D", "feature.other"), class = "data.frame", row.names = c(NA, 
-12L))
1
Did you look at melt?Andy Clifton
Yeah, if I melt, I get duplicated measures for the observations and it ends up making it really hard to do cumulative sums.Irwin
I think you need to rephrase your question. You appear to have lots of different plots and goals in mind, and it's very hard to satisfy all of these with one magic command(). Unfortunately, this is a very standard data wrangling approach: all too often things are order-dependent. A script to wrap around your functions will be helpful to keep things working nicely. So - what's your immediate goal?Andy Clifton
@AndyClifton I will ask a separate question as a breakdown.Irwin

1 Answers

2
votes
test_long = reshape(test, direction = 'long',
  idvar = 1:5, varying = 6:10, sep = ".",
)

rownames(test_long) = NULL


library(ggplot2)
ggplot(test_long, aes(x = observation, y = duration, fill = outcome)) +   
  geom_bar(position="dodge", stat="identity") + 
  facet_grid(participantID ~ time)