I am using agrepl()
to filter a data.table by fuzzy matching a word. This is working fine for me, using something like this:
library(data.table)
data <- as.data.table(iris)
pattern <- "setosh"
dt <- data[, lapply(.SD, function(x) agrepl(paste0("\\b(", pattern, ")\\b"), x, fixed = FALSE, ignore.case = TRUE))]
data<- data[rowSums(dt) > 0]
head(data)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 5.1 3.5 1.4 0.2 setosa
2: 4.9 3.0 1.4 0.2 setosa
3: 4.7 3.2 1.3 0.2 setosa
4: 4.6 3.1 1.5 0.2 setosa
5: 5.0 3.6 1.4 0.2 setosa
6: 5.4 3.9 1.7 0.4 setosa
Obviously you can see by looking at this that "setosh" will have been fuzzy matched to "setosa" in this instance. What I want is to get a vector of words that have been matched to "setosh". So although not relevant in this example, if it had included another category like "seposh", that would have matched too, so you'd have a vector that is c("setosa", "seposh")
.
EDIT:
Thanks for the answer below - I can see how it's possible to isolate the values where the fuzzy matching occurs when just looking at a vector, but my issues are:
- I only want the string that has matched, not the entire value.
- I'm having trouble replicating this over my data.table.
For e.g., if I change a value to make this point a bit more easily...
data <- as.data.table(iris)
data[Species == "versicolor", Species := "setosh species"] # changing a value so it would match
pattern <- "setosh"
dt <- data[, lapply(.SD, function(x) agrep(paste0("\\b(", pattern, ")\\b"), x, value = TRUE, fixed = FALSE, ignore.case = TRUE))]
Warning messages:
1: In as.data.table.list(jval) :
Item 1 is of size 0 but maximum size is 100, therefore recycled with 'NA'
2: In as.data.table.list(jval) :
Item 2 is of size 0 but maximum size is 100, therefore recycled with 'NA'
3: In as.data.table.list(jval) :
Item 3 is of size 0 but maximum size is 100, therefore recycled with 'NA'
4: In as.data.table.list(jval) :
Item 4 is of size 0 but maximum size is 100, therefore recycled with 'NA'
unique(dt)
Species
1: setosa
2: setosh species
You can see that I haven't got the result in a vector, and that the result includes the full value "setosh species" rather than just "setosh" (as the part that matched).
Hope that's more helpful!
s <- sample(c("setosa", "seposh", "virginica", "versicolor"), 20, T)
and then callagrep("setosh", s, value = T)
you will end up with a vector of seposhes and setosas, i.e. both were fuzzy matched. Isn't that what you want? - gershtaregexec("setosh", data$Species)
and then process the matches to get the substrings. - January