I have a dataset filled with data for a bunch of countries, by year. I need to create data for certain regions (like Belgium-Luxembourg) by adding or otherwise manipulating data for a set of other countries (Belgium and Luxembourg, in this case) and filling the resulting value into the corresponding year for the region.
So for example, say I have data for Belgium (BEL) and Luxembourg (LUX) for the year 2001-2010. I need to be able to say, add, BEL-2001 and LUX-2001 to create BLX-2001. The data set has the columns iso (country code - BEL, LUX, BLX etc), year and VARIABLE. It has all the required rows for BEL, LUX and BLX already (although BLX is empty until we fill it, of course).
A sample DATAFRAME would be:
iso year colname
BEL 1990 NA
BEL 1991 10
BEL 1992 20
BEL 1993 30
BEL 1994 10
a few rows of other countries we don't care for in this case
LUX 1990 5
LUX 1991 3
LUX 1992 NA
LUX 1993 7
LUX 1994 3
a few rows of other countries we don't care for in this case
BLX 1990 NA
BLX 1991 NA
BLX 1992 NA
BLX 1993 NA
BLX 1994 NA
In the above case, we would fill BLX data (adding BEL and LUX values) for 1991, 1992 and 1994 only - since those are the only years where both BEL and LUX have the required data. This would give us:
iso year colname
BEL 1990 NA
BEL 1991 10
BEL 1992 20
BEL 1993 30
BEL 1994 10
a few rows of other countries we don't care for in this case
LUX 1990 5
LUX 1991 3
LUX 1992 NA
LUX 1993 7
LUX 1994 3
a few rows of other countries we don't care for in this case
BLX 1990 NA
BLX 1991 13
BLX 1992 NA
BLX 1993 37
BLX 1994 13
Currently I'm achieving this, using dplyr, by this function that takes the column name and simply adds the values for every available year. This is the simplest example, more complicated manipulation look even messier:
BLXCalc <- function(colname){
LUXData <- filter(DATAFRAME, iso == "LUX" & !is.na(get(colname, envir=as.environment(DATAFRAME)))) # get only those LUX and BEL rows that have the reqd data
BELData <- filter(DATAFRAME, iso == "BEL" & !is.na(get(colname, envir=as.environment(DATAFRAME))))
BLXrange <- grep("BLX", DATAFRAME$iso) # get all BLX rows
ifelse(length(LUXData$year)<length(BELData$year), BLXyears <- LUXData$year, BLXyears <- BELData$year) # use the shorter list for the for loop
for(i in 1:length(BLXyears)){
BLXcurrentyear <- filter(DATAFRAME, iso == "LUX" & year == BLXyears[i])[[colname]] + filter(DATAFRAME, iso == "BEL" & year == BLXyears[i])[[colname]]
BLXrow <- match("BLX", DATAFRAME$iso) + match(BLXyears[i], DATAFRAME$year[BLXrange[1]:BLXrange[length(BLXrange)]]) - 1 # find the corresponding year in BLX
DATAFRAME[[colname]][BLXrow] <<- BLXcurrentyear
}
}
Even for such a simple operation (addition), this is messy code and not very easy to read. A basic breakdown of what I'm doing is:
- Get all years from the required countries that have data for the required variable/column
- Find the country with the shortest number of available years (since we need every required country to have data for a given year in order to calculate region data for that same year) Now we loop over the years that data is available for this country:
- For each year that this country has data available, get the value from the given column, for the required countries, for that year.
- If all other countries have data available for that year, sum it (or other operation - like averages/weighted averages etc)
- Fill this sum into the region's row for this year, in the same column
Steps 3, 4, 5 loop over the available years until we're done.
This works just fine for the data we're working with but I know for loops aren't the best way to use R. Is there any other more "R" way to achieve this same functionality? One that would, perhaps, be faster with larger datasets and preferably overall easier to read.
dput
to make your examples reproducibles – Sotos?dput
to see what it's about. – AkselA