tidyr/dplyr - spreading multiple variables for duplicate ids

Question

I have some awfully dirty data that I'm really struggling to clean. An example of the issue is below:

ID    NAME    ADDRESS               EMAIL     PHN
1   Alice 123 Street     [email protected] 5555555
1   Alice 123 Street                <NA> 4444444
2     Bob   9 Circle       [email protected] 1111111
3 Charlie      4 Ave   [email protected] 3333333
3 Charlie      4 Ave [email protected] 3333333
3 Charlie      4 Ave                <NA>      NA
4    Doug    1 Court                <NA> 6666666

The desired output is something like this:

ID    NAME    ADDRESS           EMAIL_1             EMAIL_2   PHN_1   PHN_2
1   Alice 123 Street   [email protected]                <NA> 5555555 4444444
2     Bob   9 Circle     [email protected]                <NA> 1111111      NA
3 Charlie      4 Ave [email protected] [email protected] 3333333      NA
4    Doug    1 Court              <NA>                <NA> 6666666      NA

With the understanding that there can be an arbitrary expansion of the EMAIL and PHN variables (i.e., there might be n repetitions of ID that have different (or NA) values.)

My solution thus far:

df.test <- df %>%
  group_by(ID) %>%
  mutate(EMAILID = paste0("EMAIL_",row_number())) %>%
  spread(EMAILID,EMAIL) %>%
  mutate(PHONEID = paste0('PHN_',row_number())) %>%
  spread(PHONEID,PHN)

But this produces an even more malformed data.frame:

ID    NAME    ADDRESS           EMAIL_1             EMAIL_2 EMAIL_3   PHN_1   PHN_2 PHN_3
1   Alice 123 Street   [email protected]                <NA>    <NA> 5555555      NA    NA
1   Alice 123 Street              <NA>                <NA>    <NA>      NA 4444444    NA
2     Bob   9 Circle     [email protected]                <NA>    <NA> 1111111      NA    NA
3 Charlie      4 Ave [email protected]                <NA>    <NA> 3333333      NA    NA
3 Charlie      4 Ave              <NA> [email protected]    <NA>      NA 3333333    NA
3 Charlie      4 Ave              <NA>                <NA>    <NA>      NA      NA    NA
4    Doug    1 Court              <NA>                <NA>    <NA> 6666666      NA    NA

Any help? I suspect that my issue is something to do with the spread() command, but my attempts so far have proven pretty fruitless. Thanks.

@RyanMorton: if I replace the code with: df.test <- df %>% group_by(ID) %>% spread(ID,EMAIL) %>% spread(ID,PHN) I get Error: Duplicate identifiers for rows (4,5) — Kyle Shank
Results in the same error, unfortunately. Same thing if I try with ADDRESS. — Kyle Shank
How about a summarise_each(funs(first) at the piped to the end of the original? — Ryan Morton

Jake Kaupp Jake Kaupp · Accepted Answer · 2017-02-13T17:09:07

You need to summarize not mutate, then use separate to split the results. To do this dynamically you can determine the number of distinct email and phone groups to use ahead of time, use separate_ and then set fill = right to remove the warnings. The last two mutate statements are there to clean up NA values turned into strings.

library(dplyr)
library(tidyr)

cols <- cols <- df %>% 
  group_by(ID) %>% 
  filter(!is.na(PHN), !is.na(EMAIL)) %>% 
  group_size() %>% 
  max()

df %>%
  group_by(ID, NAME, ADDRESS) %>%
  summarize_each(funs(toString(unique(.[!is.na(.)]))), EMAIL, PHN) %>% 
  separate_("EMAIL", sprintf("EMAIL%s", 1:cols), sep = ",", fill = "right") %>% 
  separate_("PHN", sprintf("PHN%s", 1:cols), sep = ",", fill = "right") %>% 
  mutate_if(is.character, trimws) %>% 
  mutate_each(funs(replace(., grep("NA", .), NA)))

  Source: local data frame [4 x 7]
Groups: ID, NAME [4]

     ID    NAME    ADDRESS            EMAIL1              EMAIL2    PHN1    PHN2
  <int>  <fctr>     <fctr>             <chr>               <chr>   <chr>   <chr>
1     1   Alice 123 Street   [email protected]                <NA> 5555555 4444444
2     2     Bob   9 Circle     [email protected]                <NA> 1111111    <NA>
3     3 Charlie      4 Ave [email protected] [email protected] 3333333    <NA>
4     4    Doug    1 Court              <NA>                <NA> 6666666    <NA>

The warnings will be thrown

tidyr/dplyr - spreading multiple variables for duplicate ids

3 Answers