Calculate number and names of similar sounding words from two different data frames

Question

I have two data frames

    Word1<-c("bat", "ban", "bait")
    df1<-data.frame(Word1,  stringsAsFactors=FALSE)

and

    Word2<-c("cat", "cab", "ban", "at", "done", "dot", "ran", "cant")
    df2<-data.frame(Word2,  stringsAsFactors=FALSE)

I want to calculate levenshtein distance for words in df1, using the Words in df2

I want something like this:

    Word1<-c("bat", "ban", "bait")
    links<-c("cat, ban, at", "ran","")
    counts<-c("3","1","0")
    df3<-data.frame(Word1, links, counts,  stringsAsFactors=FALSE)

Its similar to my previous question in calculation but requires two separate data frames. here's the link for the question:

calculate number and names of similar sounding words from a data frame

If you are indeed interested in similar sounding words, then you are maybe after phonetic() from stringdist library. — tmfmnk

akrun akrun · Accepted Answer · 2019-08-20T17:47:17

An option is agrep. Based on the description off agrep

Searches for approximate matches to pattern (the first argument) within each element of the string x (the second argument) using the generalized Levenshtein edit distance (the minimal possibly weighted number of insertions, deletions and substitutions needed to transform one string into another).

We loop through the elements of 'Word1' from 'df1', apply agrep, get the corresponding elements of 'df2', 'Word2' column, and create a column of 'counts' by checking the length. Finally, rbind the list of data.frames to a single one

cbind(df1['Word1'], do.call(rbind, lapply(df1$Word1, function(x) {
         i1 <- agrep(x, df2$Word2)
     data.frame(links = toString(df2$Word2[i1]) , counts = length(i1))})))

Calculate number and names of similar sounding words from two different data frames

1 Answers