Using function match() I want to perform partial string matching between two character vectors of different data frames. The position of the matched value has to be preserved as it is later used to reference the neighbouring columns, I found the function match() works best for that.
I can do exact string matching:
## exact string matching
name <- c("AAB", "AAC", "AAD","AAE")
meaning1 <- c('circular','parallel','perpendicular','none')
meaning2 <- c('surface','longitudinal','transverse','not detected')
meaning3 <- c('category 1','category 1','category 1','category 2')
referenceData <- data.frame(name, meaning1, meaning2, meaning3, stringsAsFactors = FALSE)
name2 <- c("AAB", "AAC", "AAD","AAE")
myData <- data.frame(name2, stringsAsFactors = FALSE)
> referenceData
name meaning1 meaning2 meaning3
1 AAB circular surface category 1
2 AAC parallel longitudinal category 1
3 AAD perpendicular transverse category 1
4 AAE none not detected category 2
> myData
name2
1 AAB
2 AAC
3 AAD
4 AAE
matched <- match(myData[ , 'name2'], referenceData[ ,'name'])
> matched
[1] 1 2 3 4
myData$newCol <- referenceData$meaning1[matched]
myData$newCol2 <- referenceData$meaning2[matched]
> myData
name2 newCol newCol2
1 AAB circular surface
2 AAC parallel longitudinal
3 AAD perpendicular transverse
4 AAE none not detected
However the real data has a small complication and can only be partially matched so my above method won't work:
name2 <- c("AAB Monday and Thursday", "AAC Saturday", "AAD Wednesday", "AAE Friday")
myData <- data.frame(name2, stringsAsFactors = FALSE)
> myData
name2
1 AAB Monday and Thursday
2 AAC Saturday
3 AAD Wednesday
4 AAE Friday
matched <- match(myData[ , 'name2'], referenceData[ ,'name'])
> matched
[1] NA NA NA NA
myData$newCol <- referenceData$meaning1[matched]
myData$newCol2 <- referenceData$meaning2[matched]
> myData
name2 newCol newCol2
1 AAB Monday and Thursday <NA> <NA>
2 AAC Saturday <NA> <NA>
3 AAD Wednesday <NA> <NA>
4 AAE Friday <NA> <NA>
Can match() be combined with regex somehow to do the partial matching?
EDIT The reproducible example was oversimplified. A more representative content would be:
name2 <- c("AAB Monday and Thursday", "AAC Saturday", "AAD Wednesday", "AAE Friday","AAB Monday and Thursday","AAB Monday and Thursday")
myData <- data.frame(name2, stringsAsFactors = FALSE)
> myData
name2
1 AAB Monday and Thursday
2 AAC Saturday
3 AAD Wednesday
4 AAE Friday
5 AAB Monday and Thursday
6 AAB Monday and Thursday