1
votes
library(tidyverse)  

This question is inspired by Jake Kaupp's answer to the question "tidyr/dplyr - spreading multiple variables for duplicate ids". Using the sample data provided (bottom), I want to have only one Id per row, however there are some instances of repeated id's (148 and 188). Since there are different phone numbers, I would like to spread these out dynamically so that there is a column for each number. In my real dataset I'm not sure how many repeating Id's there will be and this will continually change. I would like to do all this within Tidyverse, but I'm stuck:

First, I used the code below to find the max number of columns needed...

cols <-Df %>% 
group_by(Id) %>%
group_size() %>% 
max()

Next, I want to do something like the code below but "separate_" is deprecated. I'm not sure what to use here? I've tried various attempts with tidyr::spread, and looked at other options to solve duplicate identifiers, such as adding an index columns (mutate(I=row_numbers()), but couldn't get this to work either.

Df%>%
group_by(Id) %>%
separate_("Ph1", paste0("1:3", 1:cols), sep = ",", fill = "right") 

Is there a simple way to accomplish this? The dynamic aspect is important since the dataset will always change and ultimately I would like to make this part of a function.

Sample Data:

Id<-c(199,148,148,145,177,165,144,121,188,188,188,111)
Ph1<-c(6532881717,6572231223,6541132112,6457886543,6548887777,7372222222,6451123425,6783450101,7890986543,6785554444,8764443344,6453348736)
Ph2<-c(NA,NA,NA,NA,NA,7372222222,NA,NA,NA,6785554444,NA,NA)

Df<-data.frame(Id,Ph1,Ph2)
1

1 Answers

2
votes

You can create a new column to assign a unique id for each phone number for a given Id and then spread using that new column. This avoids the "duplicate id" problem. When spreading, this approach inherently adds as many columns as are needed to accommodate all of the unique phone numbers for each Id. For example, in the code below I've called that new column seq (for "sequence"):

library(tidyverse)

Df %>% 
  gather(key, value, -Id) %>%
  filter(!is.na(value)) %>% 
  select(-key) %>% 
  group_by(Id) %>% 
  filter(!duplicated(value)) %>% 
  mutate(seq=paste0("Phone_",1:n())) %>% 
  spread(seq, value)
     Id    Phone_1    Phone_2    Phone_3
1   111 6453348736         NA         NA
2   121 6783450101         NA         NA
3   144 6451123425         NA         NA
4   145 6457886543         NA         NA
5   148 6572231223 6541132112         NA
6   165 7372222222         NA         NA
7   177 6548887777         NA         NA
8   188 7890986543 6785554444 8764443344
9   199 6532881717         NA         NA