4
votes

Ok, so I posted a question a while back concerning writing an R function to accelerate string matching of large text files. I had my eyes opened to 'data.table' and my question was answered perfectly.

This is the link to that thread which includes all of the data and details:

Accelerate performance and speed of string match in R

But now I am running into another problem. Once in a while, the submitted VIN#s (in the 'vinDB' file) differ by one or two characters in the 'carFile' file due to human error when they fill out their car info at the DMV. Is there a way to edit the

dt[J(car.vins), list(NumTimesFound=.N), by=vin.names]

line of that code (provided by @BrodieG in the above link) to allow for a recognition of VIN#s that differ by one or two characters?

I apologize if this is an easy correction. I am just overwhelmed by the power of the 'data.table' package in R and would love to learn as much as I can of its utility, and the knowledgable members of this forum have been absolutely pivotal to me.

**EDIT:

So I have been playing around with using 'lapply' and the 'agrep' functions as suggested and I must be doing something wrong:

I tried replacing this line:

dt[J(car.vins), list(NumTimesFound=.N), by=vin.names]

with this:

dt <- dt[lapply(vin.vins, function(x) agrep(x,car.vins, max.distance=2)), list(NumTimesFound=.N), vin.names, allow.cartesian=TRUE]

But got the following error:

Error in `[.data.table`(dt, lapply(vin.vins, function(x) agrep(x,car.vins,  : 
x.'vin.vins' is a character column being joined to i.'V1' which is type 'integer'. 
Character columns must join to factor or character columns.

But they are both type 'chr'. Does anyone know why I am getting this error? And am I thinking about this the right way, ie: am I using lapply correctly here?

Thanks!

2
This isn't going to be easy. I suggest you match what will match, then for whatever is left, compute the Levenshtein Distance for the items that were not matched (not linked package not available for R3.0.2, just there for reference) against the VINs the cars that were not matched. If you know for sure that the errors are in a particular area of the VIN string then that becomes easier, but from your description it doesn't sound that way.BrodieG
Thanks @BrodieG. Is there perhaps a way to incorporate the 'agrep' function into the code?Tom A
Yes, but the expensive operations won't be in data.table. You need to lapply agrep to every VIN you're trying to match. It can be done, but it's not trivial, and it will probably be slow depending on how many bad VINS you have an how many candidate VINS there are. Certainly not a minor edit, at least not that I can think of.BrodieG
I see. Thank you very much, your insight into this is very helpful. Is there a link you recommend that can be used as a tutorial for the 'data.table' operations (aside from its package description page)?Tom A
You can look at the [data.table FAQ](datatable.r-forge.r-project.org/datatable-faq.pdf) as well as look through highly rated questions/answers on the data.table tag on SO.BrodieG

2 Answers

5
votes

I finally got it.

The agrep-function has a value-option that needs to be altered from FALSE (default) to TRUE:

dt <- dt[lapply(car.vins, agrep, x = vin.vins, max.distance = c(cost=2, all=2), value = TRUE)
         , .(NumTimesFound = .N)
         , by = vin.names]

Note: the max.distance parameters can be altered based on Levenshtein distance, substitutions, deletions, etc. 'agrep' is a fascinating function!

Thanks again for all the help!

1
votes

Thanks for the answer to your own question on partial matching. Here's the complete code that I got to work on my own machine (including the reproducible example provide by BrodieG in your linked post). I had to change lapply to sapply.

library(data.table)
set.seed(1)
makes <- c("Toyota", "Ford", "GM", "Chrysler")
years <- 1995:2014
cars <- paste(sample(makes, 500, rep=T), sample(years, 500, rep=T))
vins <- unlist(replicate(500, paste0(sample(LETTERS, 16), collapse="")))
vinDB <- data.frame(c(cars, vins)[order(rep(1:500, 2))])               
carFile <- data.frame(c(rep("junk", 1000), sample(vins, 1000, rep=T), rep("junk", 2000))[order(rep(1:1000, 4))])

vin.names <- vinDB[seq(1, nrow(vinDB), 2), ]
vin.vins <- vinDB[seq(2, nrow(vinDB), 2), ]
car.vins <- carFile[seq(2, nrow(carFile), 4), ]`

#Add some errors to car.vins strings
s <- sample(length(car.vins),100)
car.vins.err <- as.character(car.vins)
car.vins.err[s] <- gsub("A","B",car.vins.err[s])
s <- sample(length(car.vins.err),100)
car.vins.err[s] <- gsub("E","F",car.vins.err[s])
s <- sample(length(car.vins.err),100)
car.vins.err[s] <- gsub("I","J",car.vins.err[s])
car.vins.err <- as.factor(car.vins.err)`

dt <- data.table(vin.names, vin.vins, key="vin.vins")
dt1 <- dt[J(car.vins), list(NumTimesFound=.N), keyby=vin.names]
dt1.err <- dt[J(car.vins.err), list(NumTimesFound=.N), keyby=vin.names]
dt2 <- dt[sapply(car.vins, agrep, x=vin.vins, max.distance=c(cost=2, all=2), value=TRUE), list(NumTimesFound=.N), keyby=vin.names]
dt2.err <- dt[sapply(car.vins.err, agrep, x=vin.vins, max.distance=c(cost=2, all=2), value=TRUE), list(NumTimesFound=.N), keyby="vin.names"]

dt1[dt1.err][dt2.err]