0
votes

So I have a column with values, which fall within certain ranges of years (see below). I have gotten the average for each range using aggregate(). But, when I try to apply this average number for each value, I get an error. For example, for each value that is within the 1900-1910 range, I want my average for that range to appear in that row, under my "Avg" column

What I can get:

Range      Avg
1900-1910  15.33
1911-1920   6.67
....
1941-1950  22.00 

Want:

Value Year   Range       Avg
12    1906   1900-1910   15.33
15    1909   1900-1910   15.33
7     1911   1911-1920    6.67
22    1950   1941-1950   22.00
4     1917   1911-1920    6.67
9     1917   1911-1920    6.67
19    1902   1900-1910   15.33

I am able to get the averages for each range, but I cannot figure out how to apply the Avg for the range to each specific value. The only thing I can think of is a bunch of nested ifelse() statements, but that seems too tedious. For example:

d$Avg<-ifelse(Range=="1900-1910",15.33,
       ifelse(Range=="1911-1920",6.67,
       ...etc))

Is there a way that I can speed this process up instead of using a bunch of nested ifelse statements?

2

2 Answers

2
votes

The solution is to think of the aggregated data as a lookup table and then use merge to get the desired data set.

So if the aggregated data is lookupdf, then we can use merge like this:

final_df <- merge(d, lookupdf, by=c("Range"))

Sample code to demonstrate this:

d <- data.frame(Year=rep(1900+c(1:20), 20),
           Value=runif(400, 1, 20))

d$Range <- ifelse(d$Year <= 1910, "1900-1910", "1911-1920")

library(dplyr)
# generate the aggregation; should be same as what you have above.
lookupdf <- d %>% group_by(Range) %>% summarise(Avg=mean(Value))

# base R version
final_df <- merge(d, lookupdf, by=c("Range"))

Output:

> head(final_df[final_df$Year %in% c(1910, 1911),])
   Year     Value     Range      Avg
10 1910 18.643543 1900-1910 11.17740
11 1911  1.142544 1911-1920 10.18118
30 1910 11.187802 1900-1910 11.17740
31 1911  9.887889 1911-1920 10.18118
50 1910  5.316916 1900-1910 11.17740
51 1911 15.365103 1911-1920 10.18118
1
votes

I know you explicitly expressed a desire to avoid nested ifelse statements, so forgive me for using one here. But in my defense, we have a solution that essentially reuses the same ifelse to create both the Range and Avg columns, using an apply function. We also get to take averages quickly with dplyr.

This solution also assumes you only have Value and Year to begin with, since I'm not really sure how you got to the place that you are, e.g., creating the Range column. So I started from scratch.

First write a function to define the Range column

library(dplyr)

get_range <- function(number){   #<-- takes in percentile
  ans <-
    if ( (number >= 1900) & ( number <= 1910) ) {
      "1900-1910"
    } else if ( (number > 1910 ) & ( number <= 1920) ) {
      "1911-1920"
    } else if ( (number > 1920 ) & ( number <= 1930) ) {
      "1921-1930"
    } else if ( (number > 1930 ) & ( number <= 1940) ) {
      "1931-1940"
    } else if ( (number > 1940 ) & ( number <= 1950) ) {
      "1941-1950"
    } else if ( (number > 1950 ) & ( number <= 1960) ) {
      "1951-1960"
    } else if ( (number > 1960 ) & ( number <= 1970) ) {
      "1961-1970"
    } else if ( (number > 1970 ) & ( number <= 1980) ) {
      "1971-1980"
    } else if ( (number > 1980 ) & ( number <= 1990) ) {
      "1981-1990"
    } else if ( (number > 1990 ) & ( number <= 2000) ) {
      "1991-2000"
      } else {
      "NA"
    }
  return(ans)
}

Then, apply it:

df$Range <- sapply(df$Year, function(x) get_range(x))

Next, use dplyr to get your aggregate values and store in a separate dataframe.

df_Avg <- df %>%
  group_by(Range) %>%
  summarise(Avg =  mean(Value))

Modify the same function above to take care of the averages

get_avg <- function(number){
  ans <-
    if ( (number >= 1900) & ( number <= 1910) ) {
      df_Avg$Avg[1]
    } else if ( (number > 1910 ) & ( number <= 1920) ) {
      df_Avg$Avg[2]
    } else if ( (number > 1920 ) & ( number <= 1930) ) {
      df_Avg$Avg[3]
    } else if ( (number > 1930 ) & ( number <= 1940) ) {
      df_Avg$Avg[4]
    } else if ( (number > 1940 ) & ( number <= 1950) ) {
      df_Avg$Avg[5]
    } else if ( (number > 1950 ) & ( number <= 1960) ) {
      df_Avg$Avg[6]
    } else if ( (number > 1960 ) & ( number <= 1970) ) {
      df_Avg$Avg[7]
    } else if ( (number > 1970 ) & ( number <= 1980) ) {
      df_Avg$Avg[8]
    } else if ( (number > 1980 ) & ( number <= 1990) ) {
      df_Avg$Avg[9]
    } else if ( (number > 1990 ) & ( number <= 2000) ) {
      df_Avg$Avg[1]
    } else {
      "NA"
    }
  return(ans)
}

...and finally, apply the function.

df$Avg <- sapply(df$Year, function(x) get_avg(x))

Should give you this:

> df
   Value Year     Range       Avg
1     12 1906 1900-1910 15.333333
2     15 1909 1900-1910 15.333333
3      7 1911 1911-1920  6.666667
4     22 1950 1941-1950 12.500000
5      4 1917 1911-1920  6.666667
6      9 1917 1911-1920  6.666667
7     19 1902 1900-1910 15.333333
8      1 1921 1921-1930  1.000000
9      2 1931 1931-1940  2.000000
10     3 1941 1941-1950 12.500000
11     4 1951 1951-1960  4.000000
12     5 1961 1961-1970  5.000000
13     6 1971 1971-1980  6.000000
14     7 1981 1981-1990  7.000000
15     8 1991 1991-2000 15.333333

I'm sure you could do this quicker and with less code, but in exchange, this is fairly readable and you could hand it off to your coworker with having to do tons of commenting.