3
votes

My question is similar to this question. I'm trying to tidyr::gather multiple columns. However, the solution provided in the link is less than ideal because the attributes are generally not identical across all columns and so they are dropped.

Note, I know how to do this with base R, but I'm trying to learn how to do the equivalent operation with tidyr and/or dplyr.

Below I've simulated some data (poorly, but quickly) that illustrate the situation I often find myself in (although I generally have far more columns that follow this same sort of a pattern). I've provided the base solution with stats::reshape so you can see my desired output.

Any help would be much appreciated.

set.seed(123)
male_g6 <- rbinom(100, 1, .5)
ell_g6 <- rbinom(100, 1, .1)
sped_g6 <- rbinom(100, 1, .15)
pullouts_g6 <- rbinom(100, 5, .1)
disability_g6 <- replicate(100, 
                sample(
                    c("asd", "cd", "ed", "hi", "id", "ohi", "ld", "none"),
                    1,
                    prob = c(rep(0.01, 6), 0.05, 0.89)
                    )
                 )
score_g6 <- rnorm(100, 200, 10)
score_g7 <- score_g6 + 5 + rnorm(100, 0, 2)
score_g8 <- score_g7 + 5 + rnorm(100, 0, 2)

d <- data.frame(
        SID = 1:100,
        male_g6 = male_g6,
        male_g7 = male_g6,
        male_g8 = male_g6,
        ell_g6 = ell_g6,
        ell_g7 = ell_g6,
        ell_g8 = ell_g6,
        sped_g6 = sped_g6,
        sped_g7 = sped_g6,
        sped_g8 = sped_g6,
        pullouts_g6 = pullouts_g6,
        pullouts_g7 = pullouts_g6,
        pullouts_g8 = pullouts_g6,
        disability_g6 = disability_g6,
        disability_g7 = disability_g6,
        disability_g8 = disability_g6,
        score_g6 = score_g6,
        score_g7 = score_g7,
        score_g8 = score_g8
    )

With base reshape

ld <- stats::reshape(d,
        idvar = "SID",
        varying = list(
            c("male_g6", "male_g7", "male_g8"),
            c("ell_g6", "ell_g7", "ell_g8"),
            c("sped_g6", "sped_g7", "sped_g8"),
            c("pullouts_g6", "pullouts_g7", "pullouts_g8"),
            c("disability_g6", "disability_g7", "disability_g8"),
            c("score_g6", "score_g7", "score_g8")
            ),
        v.names = c("male", "ell", "sped", "pullouts", "disability", "score"),
        times = 6:8,
        timevar = "Grade",
        direction = "long"
    )
ld <- ld[order(ld$SID), ]
1
As an aside, the base R reshape can be hugely simplified - reshape(d, idvar="SID", direction="long", varying=2:19, sep="_g", timevar="Grade") - reshape has some magical guessing of variable names and times if you specify an appropriate sep=thelatemail
Sure. I was mostly just trying to provide an example of my desired output. I tend to name them all out of habit because the variables are generally all over the place and don't always have consistent names. But that is indeed simpler, and I like the use of sep in your suggestion.Daniel Anderson

1 Answers

1
votes

You'll need to gather beyond what you want to end with so you can separate the grade level from the headers, after which you can spread back to wide form:

ld2 <- d %>% gather(var, val, -SID) %>%     # gather to long form
    # separate grade from variable names
    separate(var, c('var', 'grade'), sep = '_g', convert = TRUE) %>% 
    spread(var, val, convert = TRUE)    # spread back to wide

head(ld2)

##   SID grade disability ell male pullouts    score sped
## 1   1     6         cd   0    0        1 196.2440    0
## 2   1     7         cd   0    0        1 203.2739    0
## 3   1     8         cd   0    0        1 211.1347    0
## 4   2     6       none   0    1        0 194.3812    1
## 5   2     7       none   0    1        0 195.3957    1
## 6   2     8       none   0    1        0 202.4890    1