dplyr: group_by and using the group value at summarising when creating a new variable

Question

I am working on a dataframe where I am using group_by and summarise to get some results using dplyr. However, one of the variables I intend to generate in summarise needs to access a second dataframe value based on the value of the grouping variable, and I cannot guess how to do that. Here's an example.

These are my 2 df:

ExampleData <- structure(list(country = structure(c(5L, 5L, 5L, 1L, 1L, 1L, 
                                                    4L, 4L, 4L, 2L, 2L, 2L), .Label = c("Bolivia", "Colombia", "Ecuador", 
                                                                                        "Peru", "Venezuela"), class = "factor"), area = c(21962759.1957539, 
                                                                                                                                          6116515271.82745, 4420526.44962988, 950155731.837125, 3284949253.71748, 
                                                                                                                                          13008533744.7177, 181171.153229255, 724458.059924146, 545485754.118267, 
                                                                                                                                          646585511.365563, 5586512056.6131, 4025165194.1968)), .Names = c("country", 
                                                                                                                                                                                                           "area"), row.names = c(0L, 1L, 2L, 87L, 88L, 89L, 117L, 118L, 
country.areas <- structure(list(country = c("Bolivia", "Colombia", "Ecuador", 
                                            "Peru", "Venezuela"), area = c(1090353, 1141962, 256932, 1296912, 
                                                                           916560.5)), .Names = c("country", "area"), row.names = c(NA, 
                                                                                                                                    5L), class = "data.frame")
> head(ExampleData)
     country        area
0  Venezuela    21962759
1  Venezuela  6116515272
2  Venezuela     4420526
87   Bolivia   950155732
88   Bolivia  3284949254
89   Bolivia 13008533745
> head(country.areas)
    country      area
1   Bolivia 1090353.0
2  Colombia 1141962.0
3   Ecuador  256932.0
4      Peru 1296912.0
5 Venezuela  916560.5

Now, I want to work with ExampleData, by group_by the country field and summarise to generate a variable PercOfCountry which is the sum areas for each country divided by the total area of the country, as taken from country.areas. I am trying with:

by.country <- ExampleData %>% 
  group_by(country) %>% 
  summarise(km2.country = sum(area)/1000000,
            PercOfCountry = km2.country/country.ares$area[country.areas$country == country])

where the last country (the very last word) wants to refer to area of the country which is being considered in the group_by as taken from the df country.areas (e.g.: 1090353.0 for Bolivia). The km2.country part works as expected... I just want to divide that value by the area of the country, so I get a percentage. Of course, I could do that on a next step quite easily... but I am trying to learn dplyr, and it is still difficult to understand to me which are the capabilities of the group_by function, which seems powerful.

Thanks!

@MrFlick, yes, thanks, that would definitely do it. It is one of the things I was thinking that I could do afterwords, when I have computed the table. However, I am trying to learn how to bring data from an outsider df and use it inside summarise, specifically when I need to use the value of the group... Maybe it is not possible to do that? — Javier Fajardo

Andrew Gustar Andrew Gustar · Accepted Answer · 2017-05-04T18:27:33

This should do it...

by.country <- ExampleData %>% group_by(country) %>% 
                      summarise(km2.country=sum(area)/1000000) %>% 
                      left_join(country.areas) %>% #note this brings in a new variable also called area
                      mutate(PercOfCountry=km2.country/area)

by.country
# A tibble: 2 × 4
    country km2.country      area PercOfCountry
      <chr>       <dbl>     <dbl>         <dbl>
1   Bolivia   17243.639 1090353.0    0.01581473
2 Venezuela    6142.899  916560.5    0.00670212

dplyr: group_by and using the group value at summarising when creating a new variable

1 Answers