matching highest ranking word with text in dataframe column R

Question

I have two data frames, df1:

df1 <- c("A large bunch of purple grapes", "large green potato sack", "small red tomatoes", "yellow and black bananas")
df1 <- data.frame(df1)

df2:

Word <- c("green", "purple", "grapes", "small", "sack", "yellow", "bananas", "large)

Rank <- c(20,18,22,16,15,17,6,12)

df2 <- data.frame(Word,Rank)

df1:

ID      Sentence  
 1      A large bunch of purple grapes  
 2      large green potato sack 
 3      small red tomatoes  
 4      yellow and black bananas

df2:

ID      Word      Rank
 1      green      20
 2      purple     18
 3      grapes     22
 4      small      16
 5      Sack       15
 6      yellow     17
 7      bananas    6
 8      large      12

What I want to do is; match the words in df2 to the words contained in the "Sentence" column and insert a new column in df1 containing the highest ranking matched word from df2. So something like this:

df1:

ID     Sentence                         Word
 1     A large bunch of purple grapes   grapes
 2     large green potato sack          green
 3     small red tomatoes               small
 4     yellow and black bananas         yellow

I initially used to following code to match words, but of course this creates a column containing all of the words matched:

x <- sapply(df2$Word, function(x) grepl(tolower(x), tolower(df1$Sentence)))

df1$top_match <- apply(x, 1, function(i) paste0(names(i)[i], collapse = " "))

What if a sentence does not have any word that match df2, do you want to just return NA? In this case, all sentences have a match, but I just want to make sure you are not looking for something more general. — acylam
Also, can you provide your data either as deput(df1) deput(df2) or as the code you used to generate them? — acylam

acylam acylam · Accepted Answer · 2017-10-11T14:44:38

Here's a tidyverse + stringr solution:

library(tidyverse)
library(stringr)

df1$Sentence %>%
  str_split_fixed(" ", Inf) %>%
  as.data.frame(stringsAsFactors = FALSE) %>%
  cbind(ID = rownames(df1), .) %>%
  gather(word_count, Word, -ID) %>%
  inner_join(df2, by = "Word") %>%
  group_by(ID) %>%
  filter(Rank == max(Rank)) %>%
  select(ID, Word) %>%
  right_join(rownames_to_column(df1, "ID"), by = "ID") %>%
  select(ID, Sentence, Word)

Result:

# A tibble: 4 x 3
# Groups:   ID [4]
     ID                       Sentence   Word
  <chr>                          <chr>  <chr>
1     1 A large bunch of purple grapes grapes
2     2        large green potato sack  green
3     3             small red tomatoes  small
4     4       yellow and black bananas yellow

Note:

You can ignore the warning that says coercing ID from factor into character. I also modified your datasets to include a proper column name for df1 and to suppress automatically coercing characters to factors.

Data:

df1 <- c("A large bunch of purple grapes", "large green potato sack", "small red tomatoes", "yellow and black bananas")
df1 <- data.frame(Sentence = df1, stringsAsFactors = FALSE)

Word <- c("green", "purple", "grapes", "small", "sack", "yellow", "bananas", "large")
Rank <- c(20,18,22,16,15,17,6,12)
df2 <- data.frame(Word,Rank, stringsAsFactors = FALSE)

matching highest ranking word with text in dataframe column R

2 Answers