3
votes

Suppose I have two data frames:

A = data frame consisting of unique phone numbers with an additional factor column. Suppose nrow(A) = 20

B = data frame consisting of rows representing unique households and four columns for listed phone numbers, and fifth column for unique household ID. It is possible that the same number is repeated in multiple of the B columns. Suppose nrow(B) = 100

I want to return a table that has the the "A" unique phone numbers with the household ID from "B" after checking whether the A phone numbers are in either of the four columns.

So for example:

a <- data.frame(phone=c("12345","12346","12456"),
                factor=c("OK","BAD","BAD"))
b <- data.frame(ph1 = c("12345","","12346","12347",""), 
                ph2 = c("","","12346","","12348"), 
                ph3 = c("","","","12456","67890"), 
                hhid = seq(1121,1125))

How can I return C that would look like the following:

c <- data.frame(phone = c("12345","12346","12456"),
                factor = c("OK","BAD","BAD"), 
                hhid = c("1121","1123","1124"))

I'm sure it's possible to do this in a really elegant way or with minimal amount of code. I thought about using for loops or merge but think that is on the wrong track. Open to using whichever packages.

3
Update - I received a bunch of different suggestions using different packages. This helps me learn about the different packages but also what base can do. My need is filled - but please, do feel free to share anything else that might good to know regarding this question.Sehj Kashyap

3 Answers

3
votes
library(dplyr)
library(tidyr)

a <- data.frame(phone=c("12345","12346","12456"),
                factor=c("OK","BAD","BAD"))
b <- data.frame(ph1 = c("12345","","12346","12347",""), 
                ph2 = c("","","12346","","12348"), 
                ph3 = c("","","","12456","67890"), 
                hhid = seq(1121,1125))

# reshape data and keep unique combinations
b2 = b %>% 
  gather(ph, phone, -hhid) %>% 
  select(-ph) %>% 
  distinct()

# join data frames
left_join(a, b2, by = "phone")

#   phone factor hhid
# 1 12345     OK 1121
# 2 12346    BAD 1123
# 3 12456    BAD 1124
2
votes

Here is one option with data.table

library(data.table)
setDT(a)[unique(setDT(b)[, .(phone = unlist(.SD)), hhid][phone != ""]),
          hhid := hhid, on = .(phone)]
a
#   phone factor hhid
#1: 12345     OK 1121
#2: 12346    BAD 1123
#3: 12456    BAD 1124
0
votes

Here is base R solution given that you read the data in as characters or with options: options(stringsAsFactors = F)

tmp <- unique(reshape(b, 
    direction="long",
    varying = 1:3,
    v.names="phone",
    timevar = "variable")[,c(1, 3)])
tmp[tmp$phone!="",]
merge(tmp, a, by="phone")
#  phone hhid factor
#1 12345 1121     OK
#2 12346 1123    BAD
#3 12456 1124    BAD