I am constructing complete timelines of indicators for a set of years and countries on the basis of multiple datasets with varying quality.
Using reshape2
I have "melted" those datasets into a single dataframe.
Example dataset:
d <- structure(list(cntry = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L,
1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("BE",
"DE", "GE"), class = "factor"), year = c(1960L, 1970L, 1980L,
1960L, 1970L, 1960L, 1970L, 1960L, 1970L, 1960L, 1970L, 1960L,
1970L, 1960L, 1970L, 1960L, 1970L, 1970L, 1980L), indicator = c(5.5,
1.2, 1.5, NA, 1.4, NA, NA, 5.5, 1.2, 2.3, 1.4, NA, 1.4, NA, NA,
2.3, 1.4, 1.4, NA), sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "male", class = "factor"),
source = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Council",
"Eurostat", "OECD"), class = "factor")), .Names = c("cntry",
"year", "indicator", "sex", "source"), class = "data.frame", row.names = c(NA,
-19L))
d
# cntry year indicator sex source
# 1 BE 1960 5.5 male Eurostat
# 2 BE 1970 1.2 male Eurostat
# 3 BE 1980 1.5 male Eurostat
# 4 DE 1960 NA male Eurostat
# 5 DE 1970 1.4 male Eurostat
# 6 GE 1960 NA male Eurostat
# 7 GE 1970 NA male Eurostat
# 8 BE 1960 5.5 male OECD
# 9 BE 1970 1.2 male OECD
# 10 DE 1960 2.3 male OECD
# 11 DE 1970 1.4 male OECD
# 12 GE 1960 NA male OECD
# 13 GE 1970 1.4 male OECD
# 14 BE 1960 NA male Council
# 15 BE 1970 NA male Council
# 16 DE 1960 2.3 male Council
# 17 DE 1970 1.4 male Council
# 18 GE 1970 1.4 male Council
# 19 GE 1980 NA male Council
I was hoping I could uses cast()
with fun.aggregate
to convert this long dataset into the wide format, while selecting the most high quality dataset (Eurostat > OECD > Council) for a given country-year combination to fill in the missings. Unfortunately I do not really understand how to work with such a custom aggregate function.
In other words, I want to reshape the dataset from a long to a wide format while merging multiple values depending on the value of a factor ("source"). Ideally it would work something as:
full_data <- expand.grid(c('BE', 'GE', 'DE'), c('1960', '1970', '1980'))
full_data <- fill_missings(full_data, d, pref_order=c('Eurostat', 'OECD', 'Council'))
full_data
# BE 1960 5.5 male Eurostat
# BE 1970 1.2 male Eurostat
# BE 1980 1.5 male Eurostat
# DE 1960 2.3 male OECD
# DE 1970 1.4 male Eurostat
# DE 1980 NA NA NA
# GE 1960 NA male Council
# GE 1970 1.4 male OECD
# GE 1980 NA male Council
and optionally (or directly) into the wide format:
# cntry sex 1960 1970 1980
# BE male 5.5 1.2 1.5
# DE male 2.3 1.4 NA
# GE male NA 1.4 NA