0
votes

Using tapply and sapply, i am trying sum the number of counts based on multiple (two) indices i give to tapply using sapply. The problem is the returned matrix loses the column name I give to tapply. I end up turning the matrix into a data.frame using melt() for input into ggplot and would have to add the variable names in a more manual fashion but i want them to just be retained through the two apply() functions. The metric/variable names are retained when i only use on index in tapply() so i am hung up on why they are lost with two indices.

    Fc_desc. <- rep(c(rep("Local",10),rep("Collector",10),rep("Arterial",10)),2)
Year. <- c(rep(seq(2000,2008,2),12))
df.. <- data.frame(Fc_desc = Fc_desc., Year = Year., Tot_ped_fatal_cnt = sample(length(Year.)),Tot_ped_inj_lvl_a_cnt = sample(length(Year.)))
#Define metrics(columns) of interest
Metrics. <- c("Tot_ped_fatal_cnt", "Tot_ped_inj_lvl_a_cnt")
#Summarize into long data frame
Ped_FcSv.. <- melt(sapply(Metrics., function(x){tapply(df..[,x],list(df..$Year, df..$Fc_desc), sum,na.rm=T)}),varnames = c("Fc_desc","Year","Injury_Severity"), value.name = "Count")
1
no need for tapply or sapply, try this: aggregate(.~Fc_desc + Year, data = df.., FUN = sum)bouncyball
Hey bouncyball - but i need to specify columns in the data frame to sum? In the example i have only 2 columns other than my indixes (Fc_desc and Year) but in my actual data set i 149 columns that i am not interested in dealing with. I could subset of course but that seems messy. I also need to transpose the resulting data frame into a usable for easy ggplot() -ing. If i melt your result i get the Year column stacked on my Metrics columns.Josh R.
How about you put things into long format prior to summarizing? Like df_long = reshape2::melt(df.., measure.vars = Metrics.). Then you can aggregate over just the singe variable you care about over the three grouping variables aggregate(value ~ Fc_desc + Year + variable, data = df_long, FUN = sum).aosmith
@aosmith -This does work but i have to subset my data when i apply melt() since my actual data frame has many more columns of data that gum up that function. I added some code to the code section above utilizing your example with the necessary changesJosh R.
You mean it slows things down too much or the column combinations don't have unique values? The code I gave above works with other columns... Maybe add an example more like your actual one, that has extra columns that throw a wrench into things. If the melt()/-aggregate() approach doesn't work I would probably switch to your favorite add-on package for this sort of data manipulation problem (I usually use dplyr)aosmith

1 Answers

0
votes

The initial solution i had was using a loop and list"

Metrics. <- c("Tot_ped_fatal_cnt", "Tot_ped_inj_lvl_a_cnt")
TempList_ <- list()
for(metric in Metrics.){
    TempList_[[metric]] <- tapply(df..[,metric],list(df..$Year, df..$Fc_desc),      
       sum) 
}
TempList_YrSv <- melt(TempList_, varnames = c("Year","Fc_desc"), value.name = 
    "Count")
colnames(TempList_YrSv )[3] <- "Injury_Severity"

This uses 6 lines and takes 0.46 seconds on my 717,000 rows of actual data

I modified and applied Aosmith solution:

Cols. <- c(Metrics., "Year","Fc_desc")
#Transpose data to long form
df_long <- melt(df..[,Cols.], measure.vars = Metrics., variable.name = c("Injury_Severity"), value.name = "Count")
#Apply aggregate() to sum Count on 3 indices
Ped_YrSv.. <- aggregate(Count ~ Fc_desc + Year + Injury_Severity, data = df_long, FUN = sum,na.rm=T)

This solution takes 3.9 seconds but only 3 lines. Splitting hairs I realize but i am trying to be more elegant and get away from lists and loops so this is helpful. I suppose i can be happy with this. Thanks all.