R Help! Calculate the proportion per subgroup

Question

I have the following dataset, called GrossExp3, covering the bilateral Exports (in 1000 USD) of 15 reporter Countries for all Years from (1998 – 2018) to all available partner countries It covers the following four variables: Year, ReporterName (= exporter) , PartnerName (= export destination), 'TradeValue in 1000 USD' ( = export value to the destination) The PartnerName column also includes an entry, called “All”, which is the total sum of all exports for each Year by reporter

Here is the head of my data

> head(GrossExp3, n = 20)
    Year ReporterName          PartnerName TradeValue in 1000 USD
 1: 2018       Angola          Afghanistan                 19.353
 2: 2018       Angola              Albania                  2.380
 3: 2018       Angola              Andorra                  0.326
 4: 2018       Angola United Arab Emirates             884725.078
 5: 2018       Angola            Argentina                 61.362
 6: 2018       Angola              Armenia                 60.105
 7: 2018       Angola       American Samoa                 12.007
 8: 2018       Angola  Antigua and Barbuda                422.006
 9: 2018       Angola            Australia              40220.092
10: 2018       Angola              Austria                433.699

Here is the summary of my data

> summary(GrossExp3)
      Year      ReporterName       PartnerName        TradeValue in 1000 USD
 Min.   :1998   Length:37398       Length:37398       Min.   :       0      
 1st Qu.:2004   Class :character   Class :character   1st Qu.:      39      
 Median :2009   Mode  :character   Mode  :character   Median :     596      
 Mean   :2009                                         Mean   :  135605      
 3rd Qu.:2014                                         3rd Qu.:   10209      
 Max.   :2019                                         Max.   :47471515

My goal is to filter the most important export destinations for each country by year by percentage of total exports, (all Percentage scores over 1%) and track how this changed over time. In particular, I would like to

add an additional column, called "Percentage", with the percentage of total exports by year (the sum of all entries in the 'TradeValue in 1000 USD' by Year)
drop all Percentage < 1
summarize the data for each ReporterName by Year
arrange by descending Percentage value

What I tried So far I tried it by filtering first for one ReporterName and one year

ONE_country <- GrossExp3 %>%
  group_by(Year, ReporterName) %>%
  filter(ReporterName == "Botswana", PartnerName != "All", Year == 2018) %>%
  arrange(desc(`TradeValue in 1000 USD`)) %>%
  summarize(Year, ReporterName, PartnerName, Percent = `TradeValue in 1000 USD`/sum(`TradeValue in 1000 USD`)*100)
head(ONE_country, n = 10)

I am not sure if the results I get here are right. In addition, I want the information for all countries and years to remain in the same dataset. Also I have not managed to drop all Percentage > 1, and would like the Percentage to be round without entries after the comma.

Another question is, why the summarize function does not return all columns, if I don't write them in the function?

As I have been stuck with these issues over the weekend, I would be very grateful for any recommendations on how to solve the issue!! All the best, Melike

semaphorism semaphorism · Accepted Answer · 2020-11-02T12:19:11

Without reproducible data this is difficult to answer, but this might help:

GrossExp3 %>%
  group_by(Year, ReporterName) %>%
  add_tally(wt = `TradeValue in 1000 USD`, name = "TotalValue") %>%
  mutate(Percentage = 100 * (`TradeValue in 1000 USD` / TotalValue)) %>%
  filter(Percentage >= 1) %>%
  arrange(ReporterName, Year, desc(Percentage))

We use add_tally() to calculate the total trade value by country by year and then use mutate() to calculate the percentage of this total for each row. We can then exclude rows with percentages < 1% and arrange by country, year and descending percentage.

Based on the very limited snippet you provided above, this is what is returned:

# A tibble: 2 x 6
# Groups:   Year, ReporterName [1]
   Year ReporterName PartnerName          `TradeValue in 1000 USD` TotalValue Percentage
  <dbl> <chr>        <chr>                                   <dbl>      <dbl>      <dbl>
1  2018 Angola       United Arab Emirates                  884725.    925956.      95.5 
2  2018 Angola       Australia                              40220.    925956.       4.34

R Help! Calculate the proportion per subgroup

1 Answers