I have some awfully dirty data that I'm really struggling to clean. An example of the issue is below:
ID NAME ADDRESS EMAIL PHN
1 Alice 123 Street [email protected] 5555555
1 Alice 123 Street <NA> 4444444
2 Bob 9 Circle [email protected] 1111111
3 Charlie 4 Ave [email protected] 3333333
3 Charlie 4 Ave [email protected] 3333333
3 Charlie 4 Ave <NA> NA
4 Doug 1 Court <NA> 6666666
The desired output is something like this:
ID NAME ADDRESS EMAIL_1 EMAIL_2 PHN_1 PHN_2
1 Alice 123 Street [email protected] <NA> 5555555 4444444
2 Bob 9 Circle [email protected] <NA> 1111111 NA
3 Charlie 4 Ave [email protected] [email protected] 3333333 NA
4 Doug 1 Court <NA> <NA> 6666666 NA
With the understanding that there can be an arbitrary expansion of the EMAIL and PHN variables (i.e., there might be n repetitions of ID that have different (or NA) values.)
My solution thus far:
df.test <- df %>%
group_by(ID) %>%
mutate(EMAILID = paste0("EMAIL_",row_number())) %>%
spread(EMAILID,EMAIL) %>%
mutate(PHONEID = paste0('PHN_',row_number())) %>%
spread(PHONEID,PHN)
But this produces an even more malformed data.frame:
ID NAME ADDRESS EMAIL_1 EMAIL_2 EMAIL_3 PHN_1 PHN_2 PHN_3
1 Alice 123 Street [email protected] <NA> <NA> 5555555 NA NA
1 Alice 123 Street <NA> <NA> <NA> NA 4444444 NA
2 Bob 9 Circle [email protected] <NA> <NA> 1111111 NA NA
3 Charlie 4 Ave [email protected] <NA> <NA> 3333333 NA NA
3 Charlie 4 Ave <NA> [email protected] <NA> NA 3333333 NA
3 Charlie 4 Ave <NA> <NA> <NA> NA NA NA
4 Doug 1 Court <NA> <NA> <NA> 6666666 NA NA
Any help? I suspect that my issue is something to do with the spread() command, but my attempts so far have proven pretty fruitless. Thanks.
df.test <- df %>% group_by(ID) %>% spread(ID,EMAIL) %>% spread(ID,PHN)I get Error: Duplicate identifiers for rows (4,5) - Kyle Shanksummarise_each(funs(first)at the piped to the end of the original? - Ryan Morton