I have a list of character vectors and a single character vector. I would like to perform a fuzzy matching in R between each element of the list (a character vector) to each element of a character vector (a character string) and return the maximum similarity score for each combination. Below is a toy example:
a <- c("brown fox", "lazy dog", "white cat", "I don't know", "sunset", "never mind", "excuse me")
b <- c("very late", "do not cross", "sunrise", "long vacation")
c <- c("toy example", "green apple", "tall building", "good rating", "accommodating")
mylist <- list(a,b,c)
charvec <- c("brown dog", "lazy cat", "white dress", "I know that", "excuse me please", "tall person", "new building", "good example", "green with envy", "zebra crossing")
Now, I would like to fuzzy match each element in mylist
with the first string in charvec
and return the maximum similarity score out of the 7 scores. Likewise, I want to obtain the score for each combination of mylist
and charvec
.
My attempt so far:
Convert the strings in charvec to the column names of an empty data frame
df <- setNames(data.frame(matrix(ncol = 10, nrow = 3)), c(charvec))
Calculate the maximum similarity score between each combination using jarowinkler distance from RecordLinkage package (or if there is a better distance measure for matching phrases!!)
for (j in seq_along(mylist)) {
for (i in length(ncol(df))) {
df[[i,j]] <- max(jarowinkler(names(df)[i], mylist[[j]]))
}
}
But unfortunately, I get only 3 scores in the first row with the rest of the values as NA.
Any help on this would be highly appreciated.