0
votes

I have some awfully dirty data that I'm really struggling to clean. An example of the issue is below:

ID    NAME    ADDRESS               EMAIL     PHN
1   Alice 123 Street     [email protected] 5555555
1   Alice 123 Street                <NA> 4444444
2     Bob   9 Circle       [email protected] 1111111
3 Charlie      4 Ave   [email protected] 3333333
3 Charlie      4 Ave [email protected] 3333333
3 Charlie      4 Ave                <NA>      NA
4    Doug    1 Court                <NA> 6666666

The desired output is something like this:

ID    NAME    ADDRESS           EMAIL_1             EMAIL_2   PHN_1   PHN_2
1   Alice 123 Street   [email protected]                <NA> 5555555 4444444
2     Bob   9 Circle     [email protected]                <NA> 1111111      NA
3 Charlie      4 Ave [email protected] [email protected] 3333333      NA
4    Doug    1 Court              <NA>                <NA> 6666666      NA

With the understanding that there can be an arbitrary expansion of the EMAIL and PHN variables (i.e., there might be n repetitions of ID that have different (or NA) values.)

My solution thus far:

df.test <- df %>%
  group_by(ID) %>%
  mutate(EMAILID = paste0("EMAIL_",row_number())) %>%
  spread(EMAILID,EMAIL) %>%
  mutate(PHONEID = paste0('PHN_',row_number())) %>%
  spread(PHONEID,PHN)

But this produces an even more malformed data.frame:

ID    NAME    ADDRESS           EMAIL_1             EMAIL_2 EMAIL_3   PHN_1   PHN_2 PHN_3
1   Alice 123 Street   [email protected]                <NA>    <NA> 5555555      NA    NA
1   Alice 123 Street              <NA>                <NA>    <NA>      NA 4444444    NA
2     Bob   9 Circle     [email protected]                <NA>    <NA> 1111111      NA    NA
3 Charlie      4 Ave [email protected]                <NA>    <NA> 3333333      NA    NA
3 Charlie      4 Ave              <NA> [email protected]    <NA>      NA 3333333    NA
3 Charlie      4 Ave              <NA>                <NA>    <NA>      NA      NA    NA
4    Doug    1 Court              <NA>                <NA>    <NA> 6666666      NA    NA

Any help? I suspect that my issue is something to do with the spread() command, but my attempts so far have proven pretty fruitless. Thanks.

3
What happens if you take the mutates out? - Ryan Morton
@RyanMorton: if I replace the code with: df.test <- df %>% group_by(ID) %>% spread(ID,EMAIL) %>% spread(ID,PHN) I get Error: Duplicate identifiers for rows (4,5) - Kyle Shank
Hm, could you add NAME to the group_by? - Ryan Morton
Results in the same error, unfortunately. Same thing if I try with ADDRESS. - Kyle Shank
How about a summarise_each(funs(first) at the piped to the end of the original? - Ryan Morton

3 Answers

3
votes

You need to summarize not mutate, then use separate to split the results. To do this dynamically you can determine the number of distinct email and phone groups to use ahead of time, use separate_ and then set fill = right to remove the warnings. The last two mutate statements are there to clean up NA values turned into strings.

library(dplyr)
library(tidyr)

cols <- cols <- df %>% 
  group_by(ID) %>% 
  filter(!is.na(PHN), !is.na(EMAIL)) %>% 
  group_size() %>% 
  max()

df %>%
  group_by(ID, NAME, ADDRESS) %>%
  summarize_each(funs(toString(unique(.[!is.na(.)]))), EMAIL, PHN) %>% 
  separate_("EMAIL", sprintf("EMAIL%s", 1:cols), sep = ",", fill = "right") %>% 
  separate_("PHN", sprintf("PHN%s", 1:cols), sep = ",", fill = "right") %>% 
  mutate_if(is.character, trimws) %>% 
  mutate_each(funs(replace(., grep("NA", .), NA)))

  Source: local data frame [4 x 7]
Groups: ID, NAME [4]

     ID    NAME    ADDRESS            EMAIL1              EMAIL2    PHN1    PHN2
  <int>  <fctr>     <fctr>             <chr>               <chr>   <chr>   <chr>
1     1   Alice 123 Street   [email protected]                <NA> 5555555 4444444
2     2     Bob   9 Circle     [email protected]                <NA> 1111111    <NA>
3     3 Charlie      4 Ave [email protected] [email protected] 3333333    <NA>
4     4    Doug    1 Court              <NA>                <NA> 6666666    <NA>

The warnings will be thrown

0
votes

1) reshape Using base R this can be done in 3 lines. The first line of code adds a sequence number for each ID and the last performs the transformation from long to wide. The second line of code reshapes the data frame from long to wide and the last line of code removes an columns that contain only NAs. (If columns of NAs are unlikely or you don't mind them then the third line of code could be omitted.)

df2 <- transform(df.test, seq = ave(ID, ID, FUN = seq_along))
df2 <- reshape(df2, dir = "wide", timevar = "seq", idvar = c("ID", "NAME", "ADDRESS"))
subset(df2, select = !apply(is.na(df.test2), 2, all))

giving:

  ID    NAME    ADDRESS           EMAIL.1   PHN.1             EMAIL.2   PHN.2
1  1   Alice 123 Street   [email protected] 5555555                <NA> 4444444
3  2     Bob   9 Circle     [email protected] 1111111                <NA>      NA
4  3 Charlie      4 Ave [email protected] 3333333 [email protected] 3333333
7  4    Doug    1 Court              <NA> 6666666                <NA>      NA

2) magrittr The same code except formed into a magrittr pipeline could be written:

library(magrittr)

df.test %>%
   transform(seq = ave(ID, ID, FUN = seq_along)) %>%
   reshape(dir = "wide", timevar = "seq", idvar = c("ID", "NAME", "ADDRESS")) %>%
   subset(select = !apply(is.na(.), 2, all))

Note: The input df.test in reproducible form is:

Lines <- "
ID,NAME,ADDRESS,EMAIL,PHN
1,Alice,123 Street,[email protected],5555555
1,Alice,123 Street,NA,4444444
2,Bob,9 Circle,[email protected],1111111
3,Charlie,4 Ave,[email protected],3333333
3,Charlie,4 Ave,[email protected],3333333
3,Charlie,4 Ave,NA,
4,Doug,1 Court,NA,6666666"
df.test <- read.csv(text=Lines)
0
votes

For anyone concerned about deprecation warnings with summarize_each, the following code works with currently supported functions:

df.test %>% 
  group_by(ID, NAME, ADDRESS) %>%
  summarize_at(vars(EMAIL, PHN), funs(toString(unique(.[!is.na(.)])))) %>%
  separate(EMAIL, sprintf('EMAIL%s', 1:cols), sep = ",", fill = 'right') %>%
  separate(PHN, sprintf('PHN%s', 1:cols), sep = ",", fill = 'right') %>%
  mutate_if(is.character, trimws) %>%
  mutate_all(funs(replace(., grep("NA", .), NA)))