Conditionally filling in missing values while reshaping a long to wide dataset in R

Question

I am constructing complete timelines of indicators for a set of years and countries on the basis of multiple datasets with varying quality.

Using reshape2 I have "melted" those datasets into a single dataframe.

Example dataset:

d <- structure(list(cntry = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 
1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("BE", 
"DE", "GE"), class = "factor"), year = c(1960L, 1970L, 1980L, 
1960L, 1970L, 1960L, 1970L, 1960L, 1970L, 1960L, 1970L, 1960L, 
1970L, 1960L, 1970L, 1960L, 1970L, 1970L, 1980L), indicator = c(5.5, 
1.2, 1.5, NA, 1.4, NA, NA, 5.5, 1.2, 2.3, 1.4, NA, 1.4, NA, NA, 
2.3, 1.4, 1.4, NA), sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "male", class = "factor"), 
    source = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 
    3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Council", 
    "Eurostat", "OECD"), class = "factor")), .Names = c("cntry", 
"year", "indicator", "sex", "source"), class = "data.frame", row.names = c(NA, 
-19L))


d
#    cntry year indicator  sex   source
# 1     BE 1960       5.5 male Eurostat
# 2     BE 1970       1.2 male Eurostat
# 3     BE 1980       1.5 male Eurostat
# 4     DE 1960        NA male Eurostat
# 5     DE 1970       1.4 male Eurostat
# 6     GE 1960        NA male Eurostat
# 7     GE 1970        NA male Eurostat
# 8     BE 1960       5.5 male     OECD
# 9     BE 1970       1.2 male     OECD
# 10    DE 1960       2.3 male     OECD
# 11    DE 1970       1.4 male     OECD
# 12    GE 1960        NA male     OECD
# 13    GE 1970       1.4 male     OECD
# 14    BE 1960        NA male  Council
# 15    BE 1970        NA male  Council
# 16    DE 1960       2.3 male  Council
# 17    DE 1970       1.4 male  Council
# 18    GE 1970       1.4 male  Council
# 19    GE 1980        NA male  Council

I was hoping I could uses cast() with fun.aggregate to convert this long dataset into the wide format, while selecting the most high quality dataset (Eurostat > OECD > Council) for a given country-year combination to fill in the missings. Unfortunately I do not really understand how to work with such a custom aggregate function.

In other words, I want to reshape the dataset from a long to a wide format while merging multiple values depending on the value of a factor ("source"). Ideally it would work something as:

full_data <- expand.grid(c('BE', 'GE', 'DE'), c('1960', '1970', '1980'))
full_data <- fill_missings(full_data, d, pref_order=c('Eurostat', 'OECD', 'Council'))
full_data
# BE 1960 5.5 male Eurostat
# BE 1970 1.2 male Eurostat
# BE 1980 1.5 male Eurostat
# DE 1960 2.3 male OECD
# DE 1970 1.4 male Eurostat
# DE 1980 NA  NA   NA
# GE 1960 NA  male Council 
# GE 1970 1.4 male OECD
# GE 1980 NA  male Council

and optionally (or directly) into the wide format:

# cntry  sex 1960 1970 1980
#    BE male  5.5  1.2  1.5
#    DE male  2.3  1.4  NA
#    GE male   NA  1.4  NA

Arun Arun · Accepted Answer · 2014-03-17T17:20:13

Assuming that the data is in the order you require, that is, column source is ordered first by Eurostat, then by OECD and then by council, I'd go about using data.table in this manner:

require(data.table) # >= v1.9.0
setDT(d) # converts data.frame to data.table by reference
dcast.data.table(d, cntry + sex ~ year, value.var="indicator", 
 subset=.(!duplicated(d, by=c("cntry", "year", "indicator")) & !is.na(indicator)))

#    cntry  sex 1960 1970 1980
# 1:    BE male  5.5  1.2  1.5
# 2:    DE male  2.3  1.4   NA
# 3:    GE male   NA  1.4   NA

Conditionally filling in missing values while reshaping a long to wide dataset in R

4 Answers