0
votes

Is there a way in R to transfer values from multiple data frames to a second data frame if certain conditions are met? Specifically, I am concerned with the following problem:

I have values for different people in data frame 1 (columns: name, country in ISO-3, year1, year2), polity scores for all countries in data frame 2 (columns: country in ISO-3, year, score) and gdp for all countries in data frame 3 (columns: country in ISO-3, year, gdp)

I would now like to add the corresponding polity scores from data frame 2 and gdp value from data frame 3 to data frame 1 as a new columns depending on country and year for year1 and year 2. Can I automate this via R?

df1

name country year1 year2 polity_year1 gdp_year1 polity_year2 gdp_year2
Name 1 USA 1990 2018 polity score: year1 gdp: year1 polity score: year2 gdp: year2
Name 2 DNK 1995 1999 polity score: year1 gdp: year1 polity score: year2 gdp: year2

df2

country year score
USA 1990 10
USA 2018 10
DNK 1995 10
DNK 1999 10

df3

country year gdp
USA 1990 5325
USA 2018 7230
DNK 1995 2245
DNK 1999 3256

I tried dplyr join(),

df_joined <-  left_join(df1, df2, by = c("country" = "country", "year_1" = "year"), name="polity_year1") %>%

     left_join(df1, df2, by = c("country" = "country", "year_2" = "year"),name="polity_year2") %>%

     left_join(df1, df3, by = c("country" = "country", "year_1" = "year"), name="gdp_year1") %>%

     left_join(df1, df3, by = c("country" = "country", "year_2" = "year"), name="gdp_year2")

but that fails for several reasons

  • I can only run the function for one year of df1
  • I can't run the function using %>% for df2 (polity) and df3 (gdp) at the same time
  • I can't set the new column names in df1 depending on year in df1 (e.g. "polity_year1" or "gdp_year2")

Any help would be greatly appreciated.

1
You'll need to first do a pivot_longer() so that the years are in a separate row, then when you run the joins, they might work.Tech Commodities
Also, in your code, the first value after the pipe comes from the previous line, i.e. after the %>% the d1 in left_join(df1, df2, ... should be removed, otherwise R will see three values for a join, when only two are allowed.Tech Commodities

1 Answers

0
votes

The issue id that your df1 dataset isn't in a tidy data format, which means each row is one observation, each column is one variable. In contrast, in your df1 the variables year, gdp and polity_score are spread over two columns. Hence, to solve your task it's best to reshape your data first before joining. To this end I make use of pivot_longer and pivot_wider from the tidyr package:

library(dplyr)
library(tidyr)

df1_tidy <- df1 %>% 
  rename(year_year1 = year1, year_year2 = year2) %>% 
  mutate(across(starts_with("year"), as.character)) %>% 
  pivot_longer(-c(name, country), names_to = c("var", "year1"), names_pattern = "^(.*)_(.*)$") %>% 
  pivot_wider(names_from = var, values_from = value) %>% 
  select(-year1) %>% 
  mutate(year = as.integer(year))

df1_tidy
#> # A tibble: 4 × 5
#>   name   country  year polity              gdp       
#>   <chr>  <chr>   <int> <chr>               <chr>     
#> 1 Name 1 USA      1990 polity score: year1 gdp: year1
#> 2 Name 1 USA      2018 polity score: year2 gdp: year2
#> 3 Name 2 DNK      1995 polity score: year1 gdp: year1
#> 4 Name 2 DNK      1999 polity score: year2 gdp: year2
df1_tidy %>% 
  left_join(df2, by = c("country", "year"), suffix = c("", "_country")) %>% 
  left_join(df3, by = c("country", "year"), suffix = c("", "_country"))
#> # A tibble: 4 × 7
#>   name   country  year polity              gdp        score gdp_country
#>   <chr>  <chr>   <int> <chr>               <chr>      <int>       <int>
#> 1 Name 1 USA      1990 polity score: year1 gdp: year1    10        5325
#> 2 Name 1 USA      2018 polity score: year2 gdp: year2    10        7230
#> 3 Name 2 DNK      1995 polity score: year1 gdp: year1    10        2245
#> 4 Name 2 DNK      1999 polity score: year2 gdp: year2    10        3256