I'm sitting in front of a dataframe that looks like this:
country year Indicator a b c
48996 US 2003 var1 NA NA NA
16953 FR 1988 var2 NA 10664.920 NA
22973 FR 1943 var3 NA 5774.334 NA
8760 CN 1995 var4 8804.565 NA 12750.31
47795 US 2012 var5 NA NA NA
30033 GB 1969 var6 NA 29631.362 NA
25796 FR 1921 var7 NA 14004.520 NA
39534 NL 1941 var8 NA NA NA
42255 NZ 1969 var8 NA NA NA
7249 CN 1995 var9 50635.862 NA 75260.56
What I want to do is basically a long to wide transformation with Indicator as key variable. I would usually use spread() from the tidyr package. However, spread() unfortunately does not accept multiple value columns (in this case a, b and c) and it does not fully do what I want to achieve:
- Make the entries of
Indicatorthe new columns - Keep the Country / Year combinations as rows
- Creat a UNIQUE row for every old value from
a,bandc - Create a Dummy Variable for every "old" value column name (i.e. a, b, c)
So in the end, the Chinese observations of my example should become
country year var1 [...] var4 [...] var9 dummy.a dummy.b dummy.c
CN 1995 NA 8804.565 50635.862 1 0 0
CN 1995 NA 12750.31 75260.56 0 0 1
As my original dataframe is 58.162x119, I would appreciate something that does not include a lot of manual work :-)
I hope I was clear in what I want to achieve. Thanks for your help!
The above mentioned dataframe can be reproduced using the following code:
structure(list(country = c("US", "FR", "FR", "CN", "US", "GB",
"FR", "NL", "NZ", "CN"), year = c(2003L, 1988L, 1943L, 1995L,
2012L, 1969L, 1921L, 1941L, 1969L, 1995L), Indicator = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 8L, 9L), .Label = c("var1", "var2",
"var3", "var4", "var5", "var6", "var7", "var8", "var9", "var10",
"var11", "var12", "var13", "var14", "var15", "var16", "var17",
"var18"), class = "factor"), a = c(NA, NA, NA, 8804.56480733,
NA, NA, NA, NA, NA, 50635.8621327), b = c(NA, 10664.9199219,
5774.33398438, NA, NA, 29631.3618614, 14004.5195312, NA, NA,
NA), c = c(NA, NA, NA, 12750.3056855, NA, NA, NA, NA, NA, 75260.555946
)), .Names = c("country", "year", "Indicator", "a", "b", "c"), row.names = c(48996L,
16953L, 22973L, 8760L, 47795L, 30033L, 25796L, 39534L, 42255L,
7249L), class = "data.frame")
library(data.table); melt(setDT(DF, keep.rownames = TRUE), id=c("rn", "country", "year", "Indicator"))[!is.na(value), dcast(.SD, country + year + variable ~ Indicator)][, dcast(.SD, ... ~ variable, value.var="variable", fun=length)]- Frankdputis different than your example. For example, in row 4, is the year1983or1995? - wwwCNrow is1, 0, 0and the second one is0, 0, 1in the dummy variablea to c? Because based on your original data frame,aandcboth have values for these two rows. - www