Maintain tapply indices through sapply

Question

Using tapply and sapply, i am trying sum the number of counts based on multiple (two) indices i give to tapply using sapply. The problem is the returned matrix loses the column name I give to tapply. I end up turning the matrix into a data.frame using melt() for input into ggplot and would have to add the variable names in a more manual fashion but i want them to just be retained through the two apply() functions. The metric/variable names are retained when i only use on index in tapply() so i am hung up on why they are lost with two indices.

    Fc_desc. <- rep(c(rep("Local",10),rep("Collector",10),rep("Arterial",10)),2)
Year. <- c(rep(seq(2000,2008,2),12))
df.. <- data.frame(Fc_desc = Fc_desc., Year = Year., Tot_ped_fatal_cnt = sample(length(Year.)),Tot_ped_inj_lvl_a_cnt = sample(length(Year.)))
#Define metrics(columns) of interest
Metrics. <- c("Tot_ped_fatal_cnt", "Tot_ped_inj_lvl_a_cnt")
#Summarize into long data frame
Ped_FcSv.. <- melt(sapply(Metrics., function(x){tapply(df..[,x],list(df..$Year, df..$Fc_desc), sum,na.rm=T)}),varnames = c("Fc_desc","Year","Injury_Severity"), value.name = "Count")

no need for tapply or sapply, try this: aggregate(.~Fc_desc + Year, data = df.., FUN = sum) — bouncyball
Hey bouncyball - but i need to specify columns in the data frame to sum? In the example i have only 2 columns other than my indixes (Fc_desc and Year) but in my actual data set i 149 columns that i am not interested in dealing with. I could subset of course but that seems messy. I also need to transpose the resulting data frame into a usable for easy ggplot() -ing. If i melt your result i get the Year column stacked on my Metrics columns. — Josh R.
How about you put things into long format prior to summarizing? Like df_long = reshape2::melt(df.., measure.vars = Metrics.). Then you can aggregate over just the singe variable you care about over the three grouping variables aggregate(value ~ Fc_desc + Year + variable, data = df_long, FUN = sum). — aosmith
@aosmith -This does work but i have to subset my data when i apply melt() since my actual data frame has many more columns of data that gum up that function. I added some code to the code section above utilizing your example with the necessary changes — Josh R.
You mean it slows things down too much or the column combinations don't have unique values? The code I gave above works with other columns... Maybe add an example more like your actual one, that has extra columns that throw a wrench into things. If the melt()/-aggregate() approach doesn't work I would probably switch to your favorite add-on package for this sort of data manipulation problem (I usually use dplyr) — aosmith

Josh R. Josh R. · Accepted Answer · 2018-08-01T18:03:12

The initial solution i had was using a loop and list"

Metrics. <- c("Tot_ped_fatal_cnt", "Tot_ped_inj_lvl_a_cnt")
TempList_ <- list()
for(metric in Metrics.){
    TempList_[[metric]] <- tapply(df..[,metric],list(df..$Year, df..$Fc_desc),      
       sum) 
}
TempList_YrSv <- melt(TempList_, varnames = c("Year","Fc_desc"), value.name = 
    "Count")
colnames(TempList_YrSv )[3] <- "Injury_Severity"

This uses 6 lines and takes 0.46 seconds on my 717,000 rows of actual data

I modified and applied Aosmith solution:

Cols. <- c(Metrics., "Year","Fc_desc")
#Transpose data to long form
df_long <- melt(df..[,Cols.], measure.vars = Metrics., variable.name = c("Injury_Severity"), value.name = "Count")
#Apply aggregate() to sum Count on 3 indices
Ped_YrSv.. <- aggregate(Count ~ Fc_desc + Year + Injury_Severity, data = df_long, FUN = sum,na.rm=T)

This solution takes 3.9 seconds but only 3 lines. Splitting hairs I realize but i am trying to be more elegant and get away from lists and loops so this is helpful. I suppose i can be happy with this. Thanks all.

Maintain tapply indices through sapply

1 Answers