1
votes

I have the two data frames which I have to merge. There is a column in both the data frames on which I want to merge the two data frames. But the data in these two columns are not similar. The key column in these two data frames having the length of 12 digits and other one have 5 -6 digits. I want merge on the basis of similar 5-6 digits from the second data frame.

My data Frame:

df1 = data.frame(CustomerId = c(987689000000,786581000000,765909000000,565400000000,746541000000,516890000000), Product = c(rep("Toaster", 3), rep("Radio", 3)))   

df2 = data.frame(customerId = c(987689,986581,7659090,56540,74651,5168900), State = c(rep("Alabama", 2), rep("Ohio", 1)))

I tried c = merge(df1,df2 , key =("CustomerId "),all = TRUE)

my Expected Output like :-

  CustomerId  Product    State
1  987689     Toaster     Alabama
2  786581     Toaster      Alabama
3  7659090    Toaster      Alabama
4  56540       Radio      Alabama
5  74651       Radio      Alabama
6  516890     Radio        Alabama
1

1 Answers

1
votes

Here is a solution. The key is to use formatC to adjust the format of numbers and use str_extract to extract the part that matched. After this step, you can determine if you want to use left_join, right_join, or inner_join to keep which part of the data frames. df3 is the final output.

Notice that the examples you provided contain IDs that do not match, so based on the data frames you provided it is impossible to reproduce your desired output.

# Load packages
library(dplyr)
library(stringr)
library(rebus)

# Process the data
df3 <- df1 %>%
  # Use str_extract to get CustomerId matched in df2
  mutate(CustomerId = str_extract(string = formatC(CustomerId, 
                                                   digits = 0, 
                                                   format = "f"), 
                                  pattern = or1(df2$customerId))) %>%
  # Join with df2 by the updated CustomerId
  right_join(df2 %>% 
               mutate(CustomerId = as.character(customerId)) %>%
               select(-customerId), 
            by = "CustomerId")

# View the result
df3
#  CustomerId Product   State
#1     987689 Toaster Alabama
#2     986581    <NA> Alabama
#3    7659090 Toaster    Ohio
#4      56540   Radio Alabama
#5      74651    <NA> Alabama
#6    5168900   Radio    Ohio