0
votes

I'm sitting in front of a dataframe that looks like this:

      country year Indicator         a         b        c
48996      US 2003      var1        NA        NA       NA
16953      FR 1988      var2        NA 10664.920       NA
22973      FR 1943      var3        NA  5774.334       NA
8760       CN 1995      var4  8804.565        NA 12750.31
47795      US 2012      var5        NA        NA       NA
30033      GB 1969      var6        NA 29631.362       NA
25796      FR 1921      var7        NA 14004.520       NA
39534      NL 1941      var8        NA        NA       NA
42255      NZ 1969      var8        NA        NA       NA
7249       CN 1995      var9 50635.862        NA 75260.56

What I want to do is basically a long to wide transformation with Indicator as key variable. I would usually use spread() from the tidyr package. However, spread() unfortunately does not accept multiple value columns (in this case a, b and c) and it does not fully do what I want to achieve:

  1. Make the entries of Indicator the new columns
  2. Keep the Country / Year combinations as rows
  3. Creat a UNIQUE row for every old value from a, b and c
  4. Create a Dummy Variable for every "old" value column name (i.e. a, b, c)

So in the end, the Chinese observations of my example should become

country year var1 [...] var4       [...]   var9       dummy.a dummy.b dummy.c 
CN      1995 NA         8804.565           50635.862        1       0       0
CN      1995 NA         12750.31           75260.56         0       0       1

As my original dataframe is 58.162x119, I would appreciate something that does not include a lot of manual work :-)

I hope I was clear in what I want to achieve. Thanks for your help!


The above mentioned dataframe can be reproduced using the following code:

structure(list(country = c("US", "FR", "FR", "CN", "US", "GB", 
"FR", "NL", "NZ", "CN"), year = c(2003L, 1988L, 1943L, 1995L, 
2012L, 1969L, 1921L, 1941L, 1969L, 1995L), Indicator = structure(c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 8L, 9L), .Label = c("var1", "var2", 
"var3", "var4", "var5", "var6", "var7", "var8", "var9", "var10", 
"var11", "var12", "var13", "var14", "var15", "var16", "var17", 
"var18"), class = "factor"), a = c(NA, NA, NA, 8804.56480733, 
NA, NA, NA, NA, NA, 50635.8621327), b = c(NA, 10664.9199219, 
5774.33398438, NA, NA, 29631.3618614, 14004.5195312, NA, NA, 
NA), c = c(NA, NA, NA, 12750.3056855, NA, NA, NA, NA, NA, 75260.555946
)), .Names = c("country", "year", "Indicator", "a", "b", "c"), row.names = c(48996L, 
16953L, 22973L, 8760L, 47795L, 30033L, 25796L, 39534L, 42255L, 
7249L), class = "data.frame")
2
Imo, that's a very bad data format, but you can get there like library(data.table); melt(setDT(DF, keep.rownames = TRUE), id=c("rn", "country", "year", "Indicator"))[!is.na(value), dcast(.SD, country + year + variable ~ Indicator)][, dcast(.SD, ... ~ variable, value.var="variable", fun=length)] - Frank
I think your expected based on the input is not right. For example, Var4 for 'year' 1983 should be 8804.565 and 12750.306 - akrun
The dataset you provide using dput is different than your example. For example, in row 4, is the year 1983 or 1995? - www
My bad, fixed it. I did change a year by hand to make it more clear what I want to achieve and forgot to change it in the sample code. Sorry! - yrx1702
Thanks for the updated dataset. Could you explain why the first CN row is 1, 0, 0 and the second one is 0, 0, 1 in the dummy variable a to c? Because based on your original data frame, a and c both have values for these two rows. - www

2 Answers

2
votes

Here's my solution:

require(tidyr)
mydf <- structure(list(country = c("US", "FR", "FR", "CN", "US", "GB", 
    "FR", "NL", "NZ", "CN"), year = c(2003L, 1988L, 1943L, 1995L, 
    2012L, 1969L, 1921L, 1941L, 1969L, 1995L), Indicator = structure(c(1L, 
    2L, 3L, 4L, 5L, 6L, 7L, 8L, 8L, 9L), .Label = c("var1", "var2", 
    "var3", "var4", "var5", "var6", "var7", "var8", "var9", "var10", 
    "var11", "var12", "var13", "var14", "var15", "var16", "var17", 
    "var18"), class = "factor"), a = c(NA, NA, NA, 8804.56480733, 
    NA, NA, NA, NA, NA, 50635.8621327), b = c(NA, 10664.9199219, 
    5774.33398438, NA, NA, 29631.3618614, 14004.5195312, NA, NA, 
    NA), c = c(NA, NA, NA, 12750.3056855, NA, NA, NA, NA, NA, 75260.555946
    )), .Names = c("country", "year", "Indicator", "a", "b", "c"), row.names = c(48996L, 
    16953L, 22973L, 8760L, 47795L, 30033L, 25796L, 39534L, 42255L, 
    7249L), class = "data.frame")

mydf %>% gather(key=newIndicator,value=values, a,b,c) %>% filter(!is.na(values)) %>% spread(key=Indicator,values) %>% mutate(indicatorValues=1) %>% spread(newIndicator,indicatorValues,fill=0)

The output

# country year     var2     var3      var4     var6     var7     var9 a b c
# 1      CN 1995       NA       NA  8804.565       NA       NA 50635.86 1 0 0
# 2      CN 1995       NA       NA 12750.306       NA       NA 75260.56 0 0 1
# 3      FR 1921       NA       NA        NA       NA 14004.52       NA 0 1 0
# 4      FR 1943       NA 5774.334        NA       NA       NA       NA 0 1 0
# 5      FR 1988 10664.92       NA        NA       NA       NA       NA 0 1 0
# 6      GB 1969       NA       NA        NA 29631.36       NA       NA 0 1 0
0
votes

dt would be your original data. dt2 is the final output.

dt2 <- dt %>%
  gather(Parameter, Value, a:c) %>%
  spread(Indicator, Value) %>%
  mutate(Data = ifelse(rowSums(is.na(.[, paste0("var", 1:9)])) != 9, 1, 0)) %>%
  filter(Data != 0) %>%
  spread(Parameter, Data, fill = 0) %>%
  rename(dummy.a = a, dummy.b = b, dummy.c = c)