2
votes

I am using agrepl() to filter a data.table by fuzzy matching a word. This is working fine for me, using something like this:

 library(data.table)
 data <- as.data.table(iris)
 pattern <- "setosh"
 dt <- data[, lapply(.SD, function(x) agrepl(paste0("\\b(", pattern, ")\\b"), x, fixed = FALSE, ignore.case = TRUE))] 
 data<- data[rowSums(dt) > 0]
 head(data)

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1:          5.1         3.5          1.4         0.2  setosa
2:          4.9         3.0          1.4         0.2  setosa
3:          4.7         3.2          1.3         0.2  setosa
4:          4.6         3.1          1.5         0.2  setosa
5:          5.0         3.6          1.4         0.2  setosa
6:          5.4         3.9          1.7         0.4  setosa

Obviously you can see by looking at this that "setosh" will have been fuzzy matched to "setosa" in this instance. What I want is to get a vector of words that have been matched to "setosh". So although not relevant in this example, if it had included another category like "seposh", that would have matched too, so you'd have a vector that is c("setosa", "seposh").

EDIT:

Thanks for the answer below - I can see how it's possible to isolate the values where the fuzzy matching occurs when just looking at a vector, but my issues are:

  • I only want the string that has matched, not the entire value.
  • I'm having trouble replicating this over my data.table.

For e.g., if I change a value to make this point a bit more easily...

data <- as.data.table(iris)
data[Species == "versicolor", Species := "setosh species"] # changing a value so it would match
pattern <- "setosh"

dt <- data[, lapply(.SD, function(x) agrep(paste0("\\b(", pattern, ")\\b"), x, value = TRUE, fixed = FALSE, ignore.case = TRUE))] 
Warning messages:
1: In as.data.table.list(jval) :
  Item 1 is of size 0 but maximum size is 100, therefore recycled with 'NA'
2: In as.data.table.list(jval) :
  Item 2 is of size 0 but maximum size is 100, therefore recycled with 'NA'
3: In as.data.table.list(jval) :
  Item 3 is of size 0 but maximum size is 100, therefore recycled with 'NA'
4: In as.data.table.list(jval) :
  Item 4 is of size 0 but maximum size is 100, therefore recycled with 'NA'

unique(dt)
          Species
1:         setosa
2: setosh species

You can see that I haven't got the result in a vector, and that the result includes the full value "setosh species" rather than just "setosh" (as the part that matched).

Hope that's more helpful!

2
I'm not quite sure what you're asking. If you create a character vector s <- sample(c("setosa", "seposh", "virginica", "versicolor"), 20, T) and then call agrep("setosh", s, value = T) you will end up with a vector of seposhes and setosas, i.e. both were fuzzy matched. Isn't that what you want? - gersht
Thanks gersht. This doesn't quite work because I only want the word that has been fuzzy matched not the full value. This is more apparent when there is a longer text field being searched. I've given an example in response to January's answer below. - Jaccar
Frankly, I still don't understand what you are trying to achieve with the data table. You say you want to filter rows, but you are operating on columns of the data table. So you loop over columns, and for each column you try to match the pattern to the column. But the first four columns contain numbers, so none of these match: this is why you are getting the four error messages (one per column). The fifth one finally works and you get what you are bound to get with data.table, a data.table. Stop using data.table or learn how to use it ;-) - January
You want to have a character vector, fine. Why don't you apply agrep or aregexec to the column of interest? For example using my method below? Is that now what you want? As in: aregexec("setosh", data$Species) and then process the matches to get the substrings. - January
I'm not trying to filter rows - I never say that. I don't know how to be clearer. I want a vector of words that occur anywhere in the data.table that have been fuzzy matched. I fully understand why the errors are there - I am not asking that question, I just left them in so people didn't think I was giving incomplete information. The reason I am using the approach I am is to search the entire dataset, not just a column, because unlike this example, in my data the match could occur anywhere. - Jaccar

2 Answers

0
votes

Just use the output of agrep as an index for a character vector you are grepping.

vec <- c("setosh", "setosz", "sethosz", "etosh", "ethos", "seosh")
idx <- agrep("setosh", vec) # grepl works as well
vec[idx]

result:

[1] "setosh" "setosz" "etosh"  "seosh" 

EDIT: OK, but what if we want as result only the matched string? Not the whole thing, but just the part that was matched? Then we are in for a bit of fun, because grep/grepl and agrep/agrepl don't work that way. Luckily, there is the aregexec function.

vec <- c("setosh is my name", "setosz", "sethosz who", 
         "what etosh", "ethos", "seosh", "funk setos brother")
matches <- aregexec("setosh", vec)

matches now contains a list with one element for each element of vec. Each element of this list contains a single number – start of the match – with an attribute match.length:

> matches[[1]]
[1] 1
attr(,"match.length")
[1] 6

We can use these numbers to extract the matched strings.

library(purrr)
starts <- unlist(matches)
ends <- starts - 1 + map_int(matches, ~ attr(., "match.length"))
res <- substr(vec, starts, ends)
res[ starts < 0 ] <- NA

FINAL EDIT:

I am not sure what this business with grepping all columns of iris is about, but to get a vector of matched substrings in the Species column I would do the following:

vec <- data$Species
matches <- aregexec("setosh", vec)
starts <- unlist(matches)
ends <- starts - 1 + map_int(matches, ~ attr(., "match.length"))
res <- substr(vec, starts, ends)
res[ starts < 0 ] <- NA

With res, we can do Stuff. We can remove the NA's and take a look at unique values:

res <- res[ !is.na(res) ]
unique(res)

Result:

[1] "setosa" "setosh"

FINAL FINAL EDIT: It appears that the example chosen by the OP was not exactly what they had in mind. Thus, we are going to make another example.

vec <- c("setosh is my name", "setosz", "sethosz who", 
         "what etosh", "ethos", "seosh", "funk setos brother")
data <- data.table(matrix(sample(vec, 100, replace=T), ncol=5))

data is now a data.table and in each column there are numerous things to match. If we only want to know what kind of matches are there, and we don't need to know in which columns and rows these matches were found, and we want to search through all columns, then we don't need it to be a two-dimensional object. Better make it a vector:

vec <- unlist(data)

OK, but if all that you want is to get the unique matches, we can simplify it even further:

vec <- unique(vec)

Now we have a character vector. If you now use aregexec to find your matches and extract the matches as described above you will end up with a character vector which

  • contains unique values
  • the values are the substrings that were actually matched, not the whole strings
  • only the matched substrings will be returned

The output will be:

[1] "setosh" "setosz" "setos " "seosh"  " etosh"
0
votes

If I understand you correctly you really just want to extract a fuzzy match from strings. It sounds like there is also some issue with doing this with a dataframe and returning a vector, but I think it becomes much simpler once you've successfully extracted the matching substrings.

I'll use the following toy data:

library(data.table)
set.seed(123)
data <-
    as.data.table(matrix(sample(c("setosa", "blah seposa", "blah setosh blah",
                                  "bleh versicolor", "bluh s", "bloh"),
                                15, T),
                         ncol = 3))

Which returns this data.table:

                 V1               V2               V3
1: blah setosh blah             bloh             bloh
2:             bloh blah setosh blah           setosa
3: blah setosh blah         bluh sep      blah seposa
4:      blah seposa  bleh versicolor blah setosh blah
5:      blah seposa             bloh         bluh sep

January has already pointed out that you can use aregexec to get the position of a fuzzy match in a character string. You can extract the match by passing aregexec's output into regmatches. We can do this for each column of our data using lapply:

data[, lapply(.SD, function(colu) {
    regmatches(colu, aregexec("setosh", colu, max.distance = 2))
})]

This will return a data.table, with each cell containing either the extracted fuzzy-matched substring, or an empty string if there was no match. Depending on the results you get with your real data, you may need to adjust max.distance to tweak the fuzziness of the match:

       V1     V2     V3
1: setosh              
2:        setosh setosa
3: setosh        seposa
4: seposa        setosh
5: seposa