5
votes

I'm a Stata user that's transitioning to R and there's one Stata crutch that I find hard to give up. This is because I don't know how to do the equivalent with R's "apply" functions.

In Stata, I often generate a local macro list of stubnames and then loop over that list, calling on variables whose names are built off of those stubnames.

For a simple example, imagine that I have the following dataset:

study_id year varX06 varX07 varX08 varY06 varY07 varY08
   1       6   50     40     30     20.5  19.8   17.4
   1       7   50     40     30     20.5  19.8   17.4
   1       8   50     40     30     20.5  19.8   17.4
   2       6   60     55     44     25.1  25.2   25.3
   2       7   60     55     44     25.1  25.2   25.3
   2       8   60     55     44     25.1  25.2   25.3 
   and so on...

I want to generate two new variables, varX and varY that take on the values of varX06 and varY06 respectively when year is 6, varX07 and varY07 respectively when year is 7, and varX08 and varY08 respectively when year is 8.

The final dataset should look like this:

study_id year varX06 varX07 varX08 varY06 varY07 varY08 varX varY
   1       6   50     40     30     20.5  19.8   17.4    50  20.5
   1       7   50     40     30     20.5  19.8   17.4    40  19.8
   1       8   50     40     30     20.5  19.8   17.4    30  17.4 
   2       6   60     55     44     25.1  25.2   25.3    60  25.1
   2       7   60     55     44     25.1  25.2   25.3    55  25.2
   2       8   60     55     44     25.1  25.2   25.3    44  25.3 
   and so on...

To clarify, I know that I can do this with melt and reshape commands - essentially converting this data from wide to long format, but I don't want to resort to that. That's not the intent of my question.

My question is about how to loop over a local macro list of stubnames in R and I'm just using this simple example to illustrate a more generic dilemma.

In Stata, I could generate a local macro list of stubnames:

local stub varX varY

And then loop over the macro list. I can generate a new variable varX or varY and replace the new variable value with the value of varX06 or varY06 (respectively) if year is 6 and so on.

foreach i of local stub {
    display "`i'"  
    gen `i'=.      
    replace `i'=`i'06 if year==6  
    replace `i'=`i'07 if year==7
    replace `i'=`i'08 if year==8
}

The last section is the section that I find hardest to replicate in R. When I write 'x'06, Stata takes the string "varX", concatenates it with the string "06" and then returns the value of the variable varX06. Additionally, when I write 'i', Stata returns the string "varX" and not the string "'i'".

How do I do these things with R?

I've searched through Muenchen's "R for Stata Users", googled the web, and searched through previous posts here at StackOverflow but haven't been able to find an R solution.

I apologize if this question is elementary. If it's been answered before, please direct me to the response.

Thanks in advance,
Tara

4

4 Answers

2
votes

Well, here's one way. Columns in R data frames can be accessed using their character names, so this will work:

# create sample dataset
set.seed(1)    # for reproducible example
df <- data.frame(year=as.factor(rep(6:8,each=100)),   #categorical variable
                 varX06 = rnorm(300), varX07=rnorm(300), varX08=rnorm(100),
                 varY06 = rnorm(300), varY07=rnorm(300), varY08=rnorm(100))

# you start here...
years   <- unique(df$year)
df$varX <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varX0",yr)]))
df$varY <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varY0",yr)]))

print(head(df),digits=4)
#   year  varX06  varX07  varX08   varY06  varY07  varY08    varX     varY
# 1    6 -0.6265  0.8937 -0.3411 -0.70757  1.1350  0.3412 -0.6265 -0.70757
# 2    6  0.1836 -1.0473  1.5024  1.97157  1.1119  1.3162  0.1836  1.97157
# 3    6 -0.8356  1.9713  0.5283 -0.09000 -0.8708 -0.9598 -0.8356 -0.09000
# 4    6  1.5953 -0.3836  0.5422 -0.01402  0.2107 -1.2056  1.5953 -0.01402
# 5    6  0.3295  1.6541 -0.1367 -1.12346  0.0694  1.5676  0.3295 -1.12346
# 6    6 -0.8205  1.5122 -1.1367 -1.34413 -1.6626  0.2253 -0.8205 -1.34413

For a given yr, the anonymous function extracts the rows with that yr and column named "varX0" + yr (the result of paste0(...). Then lapply(...) "applies" this function for each year, and unlist(...) converts the returned list into a vector.

0
votes

This method reorders your data, but involves a one-liner, which may or may not be better for you (assume d is your dataframe):

> do.call(rbind, by(d, d$year, function(x) { within(x, { varX <- x[, paste0('varX0',x$year[1])]; varY <- x[, paste0('varY0',x$year[1])] }) } ))
    study_id year varX06 varX07 varX08 varY06 varY07 varY08 varY varX
6.1        1    6     50     40     30   20.5   19.8   17.4 20.5   50
6.4        2    6     60     55     44   25.1   25.2   25.3 25.1   60
7.2        1    7     50     40     30   20.5   19.8   17.4 19.8   40
7.5        2    7     60     55     44   25.1   25.2   25.3 25.2   55
8.3        1    8     50     40     30   20.5   19.8   17.4 17.4   30
8.6        2    8     60     55     44   25.1   25.2   25.3 25.3   44

Essentially, it splits the data based on year, then uses within to create the varX and varY variables within each subset, and then rbind's the subsets back together.

A direct translation of your Stata code, however, would be something like the following:

u <- unique(d$year)
for(i in seq_along(u)){
    d$varX <- ifelse(d$year == 6, d$varX06, ifelse(d$year == 7, d$varX07, ifelse(d$year == 8, d$varX08, NA)))
    d$varY <- ifelse(d$year == 6, d$varY06, ifelse(d$year == 7, d$varY07, ifelse(d$year == 8, d$varY08, NA)))
}
0
votes

Here's another option.

Create a 'column selection matrix' based on year, then use that to grab the values you want from any block of columns.

# indexing matrix based on the 'year' column
col_select_mat <- 
    t(sapply(your_df$year, function(x) unique(your_df$year) == x))

# make selections from col groups by stub name
sapply(c('varX', 'varY'), 
    function(x) your_df[, grep(x, names(your_df))][col_select_mat])

This gives the desired result (which you can cbind to your_df if you like)

    varX varY
[1,]   50 20.5
[2,]   60 25.1
[3,]   40 19.8
[4,]   55 25.2
[5,]   30 17.4
[6,]   44 25.3

OP's dataset:

your_df <- read.table(header=T, text=
'study_id year varX06 varX07 varX08 varY06 varY07 varY08
   1       6   50     40     30     20.5  19.8   17.4
   1       7   50     40     30     20.5  19.8   17.4
   1       8   50     40     30     20.5  19.8   17.4
   2       6   60     55     44     25.1  25.2   25.3
   2       7   60     55     44     25.1  25.2   25.3
   2       8   60     55     44     25.1  25.2   25.3')

Benchmarking: Looking at the three posted solutions, this appears to be the fastest on average, but the differences are very small.

df <- your_df
d <- your_df

arvi1000 <- function() {
  col_select_mat <- t(sapply(your_df$year, function(x) unique(your_df$year) == x))
  # make selections from col groups by stub name
  cbind(your_df, 
        sapply(c('varX', 'varY'), 
               function(x) your_df[, grep(x, names(your_df))][col_select_mat]))
}

jlhoward <- function() {
  years   <- unique(df$year)
  df$varX <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varX0",yr)]))
  df$varY <- unlist(lapply(years,function(yr)df[df$year==yr,paste0("varY0",yr)]))
}

Thomas <- function() {
  do.call(rbind, by(d, d$year, function(x) { within(x, { varX <- x[, paste0('varX0',x$year[1])]; varY <- x[, paste0('varY0',x$year[1])] }) } ))
}

> microbenchmark(arvi1000, jlhoward, Thomas)
Unit: nanoseconds
     expr min lq  mean median uq  max neval
 arvi1000  37 39 43.73     40 42  380   100
 jlhoward  38 40 46.35     41 42  377   100
   Thomas  37 40 56.99     41 42 1590   100
0
votes

Maybe a more transparent way:

sub <- c("varX", "varY")
for (i in sub) {
 df[[i]] <- NA
 df[[i]] <- ifelse(df[["year"]] == 6, df[[paste0(i, "06")]], df[[i]])
 df[[i]] <- ifelse(df[["year"]] == 7, df[[paste0(i, "07")]], df[[i]])
 df[[i]] <- ifelse(df[["year"]] == 8, df[[paste0(i, "08")]], df[[i]])
}