1
votes

I have a dataset filled with data for a bunch of countries, by year. I need to create data for certain regions (like Belgium-Luxembourg) by adding or otherwise manipulating data for a set of other countries (Belgium and Luxembourg, in this case) and filling the resulting value into the corresponding year for the region.

So for example, say I have data for Belgium (BEL) and Luxembourg (LUX) for the year 2001-2010. I need to be able to say, add, BEL-2001 and LUX-2001 to create BLX-2001. The data set has the columns iso (country code - BEL, LUX, BLX etc), year and VARIABLE. It has all the required rows for BEL, LUX and BLX already (although BLX is empty until we fill it, of course).

A sample DATAFRAME would be:

    iso year    colname
    BEL 1990    NA
    BEL 1991    10
    BEL 1992    20
    BEL 1993    30
    BEL 1994    10
a few rows of other countries we don't care for in this case
    LUX 1990    5
    LUX 1991    3
    LUX 1992    NA
    LUX 1993    7
    LUX 1994    3
a few rows of other countries we don't care for in this case
    BLX 1990    NA
    BLX 1991    NA
    BLX 1992    NA
    BLX 1993    NA
    BLX 1994    NA

In the above case, we would fill BLX data (adding BEL and LUX values) for 1991, 1992 and 1994 only - since those are the only years where both BEL and LUX have the required data. This would give us:

    iso year    colname
    BEL 1990    NA
    BEL 1991    10
    BEL 1992    20
    BEL 1993    30
    BEL 1994    10
a few rows of other countries we don't care for in this case
    LUX 1990    5
    LUX 1991    3
    LUX 1992    NA
    LUX 1993    7
    LUX 1994    3
a few rows of other countries we don't care for in this case
    BLX 1990    NA
    BLX 1991    13
    BLX 1992    NA
    BLX 1993    37
    BLX 1994    13


Currently I'm achieving this, using dplyr, by this function that takes the column name and simply adds the values for every available year. This is the simplest example, more complicated manipulation look even messier:

BLXCalc <- function(colname){

  LUXData <- filter(DATAFRAME, iso == "LUX" & !is.na(get(colname, envir=as.environment(DATAFRAME)))) # get only those LUX and BEL rows that have the reqd data

  BELData <- filter(DATAFRAME, iso == "BEL" & !is.na(get(colname, envir=as.environment(DATAFRAME))))

  BLXrange <- grep("BLX", DATAFRAME$iso) # get all BLX rows

  ifelse(length(LUXData$year)<length(BELData$year), BLXyears <- LUXData$year, BLXyears <- BELData$year) # use the shorter list for the for loop

  for(i in 1:length(BLXyears)){

    BLXcurrentyear <- filter(DATAFRAME, iso == "LUX" & year == BLXyears[i])[[colname]] + filter(DATAFRAME, iso == "BEL" & year == BLXyears[i])[[colname]]

    BLXrow <- match("BLX", DATAFRAME$iso) + match(BLXyears[i], DATAFRAME$year[BLXrange[1]:BLXrange[length(BLXrange)]]) - 1 # find the corresponding year in BLX

    DATAFRAME[[colname]][BLXrow] <<- BLXcurrentyear
  }
}

Even for such a simple operation (addition), this is messy code and not very easy to read. A basic breakdown of what I'm doing is:

  1. Get all years from the required countries that have data for the required variable/column
  2. Find the country with the shortest number of available years (since we need every required country to have data for a given year in order to calculate region data for that same year) Now we loop over the years that data is available for this country:
  3. For each year that this country has data available, get the value from the given column, for the required countries, for that year.
  4. If all other countries have data available for that year, sum it (or other operation - like averages/weighted averages etc)
  5. Fill this sum into the region's row for this year, in the same column

Steps 3, 4, 5 loop over the available years until we're done.

This works just fine for the data we're working with but I know for loops aren't the best way to use R. Is there any other more "R" way to achieve this same functionality? One that would, perhaps, be faster with larger datasets and preferably overall easier to read.

2
In my understanding you just need to merge/add Belgium and Luxembourg? - Also please use dput to make your examples reproduciblesSotos
Basically, I just need to add the values where available, yes. Only difference between merge is that instead of just filling in blank values, I need to actually perform an operation on two sets of data and then merge the result in. I'm not familiar with dput, how can I make it more reproducible?jackson5
I posted a solution. Give it a look and If it works for you I ll add some explanation. Otherwise I will delete.Sotos
@jackson5: run ?dput to see what it's about.AkselA

2 Answers

2
votes

Here is a possible solution. You first split on the country and create a list with each country as a different element. Using Reduce, you can merge as many elements as you like (function(...)) which you specify by name. Finally, you use rowSums without removing NAs to add the required variables. If you refer to the function (fill_countries), you can then assign the results to the interested subset of the data (again specifying the subset by name).

l1 <- split(df, df$iso)
d1 <- Reduce(function(...)merge(..., by = 'year'), l1[names(l1) %in% c('BEL', 'LUX')])
rowSums(d1[grepl('colname', names(d1))])
#[1] NA 13 NA 37 13

You can also turn it to a function,

fill_countries <- function(df, country_to_fill, countries_to_use){
  l1 <- split(df, df$iso)
  d1 <- Reduce(function(...)merge(..., by = 'year'), l1[names(l1) %in% countries_to_use])
  df$colname[df$iso == country_to_fill] <- rowSums(d1[grepl('colname', names(d1))])
  return(df)
}

fill_countries(df, 'BLX', c('BEL', 'LUX'))
#   iso year colname
#1  BEL 1990      NA
#2  BEL 1991      10
#3  BEL 1992      20
#4  BEL 1993      30
#5  BEL 1994      10
#6  LUX 1990       5
#7  LUX 1991       3
#8  LUX 1992      NA
#9  LUX 1993       7
#10 LUX 1994       3
#11 BLX 1990      NA
#12 BLX 1991      13
#13 BLX 1992      NA
#14 BLX 1993      37
#15 BLX 1994      13
0
votes

With data.table this can be solved in a "one-liner":

library(data.table) # CRAN version 1.10.4 used
# select countries, aggregate by year, 
# finally, append resulting rows to original data.frame 
rbind(DF, setDT(DF)[iso %in% c("BEL", "LUX"), 
                    .(iso = "BLX", colname = sum(colname)), by = year])

which returns:

    iso year colname
 1: BEL 1990      NA
 2: BEL 1991      10
 3: BEL 1992      20
 4: BEL 1993      30
 5: BEL 1994      10
 6: LUX 1990       5
 7: LUX 1991       3
 8: LUX 1992      NA
 9: LUX 1993       7
10: LUX 1994       3
11: BLX 1990      NA
12: BLX 1991      13
13: BLX 1992      NA
14: BLX 1993      37
15: BLX 1994      13

The OP has indicated that there are several regions which he needs to combine, not just Belgium and Luxemburg. The above code can be embedded in a call to lapply() to combine several regions at once:

# define countries and names of regions
map <- list(
  BLX = c("BEL", "LUX"),
  BNL = c("BEL", "NLD", "LUX"), # BeNeLux countries
  IBE = c("AND", "ESP", "GIB", "PRT") # Iberian peninsula
)
# aggregate regions and add to original data set
setDT(DF)
rbindlist(c(
  list(DF),
  lapply(seq_along(map), function(i) 
    DF[iso %in% map[[i]], .(iso = names(map)[i], colname = sum(colname)), by = year]
  )), use.names = TRUE)

Note that index numbers i are used to access the names within map. lapply() returns a list of data.table objects, so rbindlist() is used to append all together but we need to set use.names = TRUE explicitely.

    iso year colname
 1: BEL 1990      NA
 2: BEL 1991      10
 3: BEL 1992      20
 4: BEL 1993      30
 5: BEL 1994      10
 6: LUX 1990       5
 7: LUX 1991       3
 8: LUX 1992      NA
 9: LUX 1993       7
10: LUX 1994       3
11: BLX 1990      NA
12: BLX 1991      13
13: BLX 1992      NA
14: BLX 1993      37
15: BLX 1994      13
16: BNL 1990      NA
17: BNL 1991      13
18: BNL 1992      NA
19: BNL 1993      37
20: BNL 1994      13