1
votes

I have two data frames

    Word1<-c("bat", "ban", "bait")
    df1<-data.frame(Word1,  stringsAsFactors=FALSE)

and

    Word2<-c("cat", "cab", "ban", "at", "done", "dot", "ran", "cant")
    df2<-data.frame(Word2,  stringsAsFactors=FALSE)

I want to calculate levenshtein distance for words in df1, using the Words in df2

I want something like this:

    Word1<-c("bat", "ban", "bait")
    links<-c("cat, ban, at", "ran","")
    counts<-c("3","1","0")
    df3<-data.frame(Word1, links, counts,  stringsAsFactors=FALSE)

Its similar to my previous question in calculation but requires two separate data frames. here's the link for the question:

calculate number and names of similar sounding words from a data frame

1
If you are indeed interested in similar sounding words, then you are maybe after phonetic() from stringdist library.tmfmnk

1 Answers

5
votes

An option is agrep. Based on the description off agrep

Searches for approximate matches to pattern (the first argument) within each element of the string x (the second argument) using the generalized Levenshtein edit distance (the minimal possibly weighted number of insertions, deletions and substitutions needed to transform one string into another).

We loop through the elements of 'Word1' from 'df1', apply agrep, get the corresponding elements of 'df2', 'Word2' column, and create a column of 'counts' by checking the length. Finally, rbind the list of data.frames to a single one

cbind(df1['Word1'], do.call(rbind, lapply(df1$Word1, function(x) {
         i1 <- agrep(x, df2$Word2)
     data.frame(links = toString(df2$Word2[i1]) , counts = length(i1))})))