Ok, so I posted a question a while back concerning writing an R function to accelerate string matching of large text files. I had my eyes opened to 'data.table' and my question was answered perfectly.
This is the link to that thread which includes all of the data and details:
Accelerate performance and speed of string match in R
But now I am running into another problem. Once in a while, the submitted VIN#s (in the 'vinDB' file) differ by one or two characters in the 'carFile' file due to human error when they fill out their car info at the DMV. Is there a way to edit the
dt[J(car.vins), list(NumTimesFound=.N), by=vin.names]
line of that code (provided by @BrodieG in the above link) to allow for a recognition of VIN#s that differ by one or two characters?
I apologize if this is an easy correction. I am just overwhelmed by the power of the 'data.table' package in R and would love to learn as much as I can of its utility, and the knowledgable members of this forum have been absolutely pivotal to me.
**EDIT:
So I have been playing around with using 'lapply' and the 'agrep' functions as suggested and I must be doing something wrong:
I tried replacing this line:
dt[J(car.vins), list(NumTimesFound=.N), by=vin.names]
with this:
dt <- dt[lapply(vin.vins, function(x) agrep(x,car.vins, max.distance=2)), list(NumTimesFound=.N), vin.names, allow.cartesian=TRUE]
But got the following error:
Error in `[.data.table`(dt, lapply(vin.vins, function(x) agrep(x,car.vins, :
x.'vin.vins' is a character column being joined to i.'V1' which is type 'integer'.
Character columns must join to factor or character columns.
But they are both type 'chr'. Does anyone know why I am getting this error? And am I thinking about this the right way, ie: am I using lapply correctly here?
Thanks!
data.table
. You need tolapply
agrep
to every VIN you're trying to match. It can be done, but it's not trivial, and it will probably be slow depending on how many bad VINS you have an how many candidate VINS there are. Certainly not a minor edit, at least not that I can think of. – BrodieGdata.table
tag on SO. – BrodieG