Fuzzy string matching of a list of character vectors to a character vector

Question

I have a list of character vectors and a single character vector. I would like to perform a fuzzy matching in R between each element of the list (a character vector) to each element of a character vector (a character string) and return the maximum similarity score for each combination. Below is a toy example:

a <- c("brown fox", "lazy dog", "white cat", "I don't know", "sunset", "never mind", "excuse me")
b <- c("very late", "do not cross", "sunrise", "long vacation")
c <- c("toy example", "green apple", "tall building", "good rating", "accommodating")
mylist <- list(a,b,c)

charvec <- c("brown dog", "lazy cat", "white dress", "I know that", "excuse me please", "tall person", "new building", "good example", "green with envy", "zebra crossing")

Now, I would like to fuzzy match each element in mylist with the first string in charvec and return the maximum similarity score out of the 7 scores. Likewise, I want to obtain the score for each combination of mylist and charvec.

My attempt so far:

Convert the strings in charvec to the column names of an empty data frame

df <- setNames(data.frame(matrix(ncol = 10, nrow = 3)), c(charvec))

Calculate the maximum similarity score between each combination using jarowinkler distance from RecordLinkage package (or if there is a better distance measure for matching phrases!!)

for (j in seq_along(mylist)) {
  for (i in length(ncol(df))) {
    df[[i,j]] <- max(jarowinkler(names(df)[i], mylist[[j]]))
  }
}

But unfortunately, I get only 3 scores in the first row with the rest of the values as NA.

Any help on this would be highly appreciated.

zack zack · Accepted Answer · 2018-07-12T15:31:55

using purrr package

mylist <- setNames(mylist, c('a', 'b', 'c'))

library(purrr)

map_dfr(charvec,
    function(wrd, vec_list){
      setNames(map_df(vec_list, ~max(jarowinkler(wrd, .x))),
               names(vec_list)
      )

    },
    mylist)

# A tibble: 10 x 3
       a     b     c
   <dbl> <dbl> <dbl>
 1 0.911 0.580 0.603
 2 0.85  0.713 0.603
 3 0.842 0.557 0.515
 4 0.657 0.490 0.409
 5 0.912 0.489 0.659
 6 0.538 0.546 0.801
 7 0.716 0.547 0.740
 8 0.591 0.524 0.856
 9 0.675 0.509 0.821
10 0.619 0.587 0.630

If you'd like it wide:

map_dfc(charvec,
         function(wrd, vec_list) {
          set_names(list(map_dbl(vec_list, ~max(jarowinkler(wrd, .x)))),
                    wrd)
         },
        mylist
)

# A tibble: 3 x 10
  `brown dog` `lazy cat` `white dress` `I know that` `excuse me plea~ `tall person` `new building` `good example`
        <dbl>      <dbl>         <dbl>         <dbl>            <dbl>         <dbl>          <dbl>          <dbl>
1       0.911      0.85          0.842         0.657            0.912         0.538          0.716          0.591
2       0.580      0.713         0.557         0.490            0.489         0.546          0.547          0.524
3       0.603      0.603         0.515         0.409            0.659         0.801          0.740          0.856
# ... with 2 more variables: `green with envy` <dbl>, `zebra crossing` <dbl>

Fuzzy string matching of a list of character vectors to a character vector

3 Answers