Extract category Information from columns, based on similarity to patterns

Question

Suppose I have the following data frame:

table<-data.frame(col1=c('4.3 automatic version 1', '3.2 manual version 2', 
                         '2.3 version 1', '9.0 version 6'),
                  col2=c('ite auto version 2', 'ite version 3', '2.5 manual version 2',
                         'vserion auto 5'))

                     col1                 col2
1 4.3 automatic version 1   ite auto version 2
2    3.2 manual version 2        ite version 3
3           2.3 version 1 2.5 manual version 2
4           9.0 version 6       vserion auto 5

I want to add a column with values of only 'automatic' or 'manual', based on the contents of columns 1 and 2. If col1 or col2 includes some word like either 'auto' or 'automatic' then col3 would be 'automatic'. If col1 or col2 is like 'manual' then col3 would be 'manual', like this:

                     col1                 col2      col3
1 4.3 automatic version 1   ite auto version 2 automatic
2    3.2 manual version 2        ite version 3    manual
3           2.3 version 1 2.5 manual version 2    manual
4           9.0 version 6       vserion auto 5 automatic

is column 3 going to be binary (only auto or manual), or more open ended (auto, manual, neither, both, other...) — Mark Miller

Mark Miller Mark Miller · Accepted Answer · 2017-06-06T23:13:14

I like to keep things flexible. I also like to keep intermediate data structures. So there's almost certainly something shorter and more memory efficient than this.

Note that I'm using regular expressions to search with flexibility (based on your use of the words similarity and like). To demonstrate the effects, I have made some changes to your input data. I have also added some edge cases.

An alternative approach might use the tm text mining package. That would give you even more flexibility that this grep solution, at the cost of some additional complexity.

my.table <-
  data.frame(
    col1 = c(
      '4.3 automatic version 1',
      '3.2 manual version 2',
      '2.3 version 1',
      '9.0 version 6',
      'maybe standard',
      'or neither'
    ),
    col2 = c(
      'ite automated version 2',
      'ite version 3',
      '2.5 manual version 2',
      'vserion auto 5',
      'maybe automatic',
      'for reals'
    )
  )

search.terms <- c("auto|automated|automatic", "manual|standard")
names(search.terms) <- c("automatic", "manual")

term.test <- function(term)  {
  term.pres <- apply(
    my.table,
    MARGIN = 1,
    FUN = function(one.cell) {
      any(grep(pattern = term, x = one.cell))
    }
  )
  return(term.pres)
}

term.presence <- lapply(X = search.terms, term.test)

term.presence <- do.call(cbind.data.frame, term.presence)

names(term.presence) <- names(search.terms)

as.labels <- lapply(names(search.terms), function(one.term) {
  tempcol <- tempflag <- term.presence[, one.term]
  tempcol <- rep('', length(tempflag))
  tempcol[tempflag] <- one.term
  return(tempcol)
})

as.labels <- do.call(cbind.data.frame, as.labels)
names(as.labels) <- search.terms

labels.concat <-
  apply(
    as.labels,
    MARGIN = 1,
    FUN = function(one.row) {
      temp <- unique(sort(one.row))
      temp <- temp[nchar(temp) > 0]
      temp <- paste(temp, sep = ", ", collapse = "; ")
      return(temp)
    }
  )

my.table$col3 <- labels.concat

print(my.table)

which gives

                     col1                    col2              col3
1 4.3 automatic version 1 ite automated version 2         automatic
2    3.2 manual version 2           ite version 3            manual
3           2.3 version 1    2.5 manual version 2            manual
4           9.0 version 6          vserion auto 5         automatic
5          maybe standard         maybe automatic automatic; manual
6              or neither               for reals                  
>

Extract category Information from columns, based on similarity to patterns

1 Answers