0
votes

Using function match() I want to perform partial string matching between two character vectors of different data frames. The position of the matched value has to be preserved as it is later used to reference the neighbouring columns, I found the function match() works best for that.

I can do exact string matching:

## exact string matching
name <-  c("AAB", "AAC", "AAD","AAE")
meaning1 <- c('circular','parallel','perpendicular','none') 
meaning2 <- c('surface','longitudinal','transverse','not detected') 
meaning3 <- c('category 1','category 1','category 1','category 2') 
referenceData <- data.frame(name, meaning1, meaning2, meaning3, stringsAsFactors = FALSE)
name2 <- c("AAB", "AAC", "AAD","AAE")
myData <- data.frame(name2, stringsAsFactors = FALSE)
> referenceData
  name      meaning1     meaning2   meaning3
1  AAB      circular      surface category 1
2  AAC      parallel longitudinal category 1
3  AAD perpendicular   transverse category 1
4  AAE          none not detected category 2
> myData 
  name2
1   AAB
2   AAC
3   AAD
4   AAE

matched <- match(myData[ , 'name2'],  referenceData[ ,'name'])
> matched
[1] 1 2 3 4

myData$newCol <- referenceData$meaning1[matched]
myData$newCol2 <- referenceData$meaning2[matched]
> myData
  name2        newCol      newCol2
1   AAB      circular      surface
2   AAC      parallel longitudinal
3   AAD perpendicular   transverse
4   AAE          none not detected

However the real data has a small complication and can only be partially matched so my above method won't work:

name2 <- c("AAB Monday and Thursday", "AAC Saturday", "AAD Wednesday", "AAE Friday")
myData <- data.frame(name2, stringsAsFactors = FALSE)
> myData 
                    name2
1 AAB Monday and Thursday
2            AAC Saturday
3           AAD Wednesday
4              AAE Friday

 matched <- match(myData[ , 'name2'],  referenceData[ ,'name'])
> matched
[1] NA NA NA NA

myData$newCol <- referenceData$meaning1[matched]
myData$newCol2 <- referenceData$meaning2[matched]
> myData
                    name2 newCol newCol2
1 AAB Monday and Thursday   <NA>    <NA>
2            AAC Saturday   <NA>    <NA>
3           AAD Wednesday   <NA>    <NA>
4              AAE Friday   <NA>    <NA>

Can match() be combined with regex somehow to do the partial matching?

EDIT The reproducible example was oversimplified. A more representative content would be:

name2 <- c("AAB Monday and Thursday", "AAC Saturday", "AAD Wednesday", "AAE Friday","AAB Monday and Thursday","AAB Monday and Thursday")
myData <- data.frame(name2, stringsAsFactors = FALSE)
> myData
                    name2
1 AAB Monday and Thursday
2            AAC Saturday
3           AAD Wednesday
4              AAE Friday
5 AAB Monday and Thursday
6 AAB Monday and Thursday
1

1 Answers

1
votes

You could use sapply and grep like this:

sapply(referenceData[, 'name'], grep, myData[, 'name2'])

Note that I inverted the order of the arguments. "AAB" as a regexp matches "AAB Monday and Thursday", but not vice versa

Edit: given your edit, if you know you always matching just the first three characters, you might try this simple approach (no partial match necessary):

first3 <- substr(myData[ , 'name2'],  1, 3)
match(first3,  referenceData[ ,'name'])