EDIT: I have a data frame wherein column 1 has id on some texts and column 2 has the text itself as strings. I have a set of multiple words and the task is to have stringr count the number of times each word is found within the texts. The words are to be supplied as fixed, not as regex.
Two problems stand out:
(1) How to supply a vector containing multiple words as a fixed (not regex) pattern?
(2) How to append the findings to the data frame?
(3) How to do it for very large data?
A previous answer from user @akrun answered points (1) and (2), but (3) is still an issue. Here is a reproducible example.
## create a very large data.frame with the text column to be analyzed
doc_number <- c()
doc_text <- c()
for(i in 1:60000){
# generate many random strings mentioning 'proposals'
doc_number[i] <- paste0("doc_",i)
set.seed(i+3)
doc_text[i] <- paste0("This is about proposal ", "(", sample(1000:9999, 1), "/", sample(letters, 1),")",
" and about proposal ", "(", sample(1000:9999, 1), "/", sample(letters, 1),")")
}
docs_example_df <- data.frame(doc_number, doc_text)
head(docs_example_df) # resulting df has 'doc_text' column which mentions proposals
> head(docs_example_df)
doc_number doc_text
1 doc_1 This is about proposal (6623/k) and about proposal (3866/c)
2 doc_2 This is about proposal (3254/k) and about proposal (2832/u)
3 doc_3 This is about proposal (7964/j) and about proposal (1940/n)
4 doc_4 This is about proposal (8582/g) and about proposal (3753/o)
5 doc_5 This is about proposal (4254/b) and about proposal (5686/l)
6 doc_6 This is about proposal (2588/f) and about proposal (9786/c)
# create a very large vector of 'proposals' I want to extract from doc_text
my_proposals <- c()
for(i in 1:20000){
set.seed(i+8)
my_proposals[i] <- paste0("proposal ", "(", sample(1000:9999, 1), "/", sample(letters, 1),")")
}
head(my_proposals) # long list of 'proposals' I wish to locate
> head(my_proposals)
[1] "proposal (2588/f)" "proposal (1490/i)" "proposal (2785/b)" "proposal (5545/z)" "proposal (6988/j)" "proposal (1264/i)"
The previous answer by @akrun (see below) recommended several solutions which worked for a small data.frame. But in such >20k objects the functions either jam or give error such as:
Problem with mutate() input matches. x Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)
So, in short, how to apply a very long list of vectors to an also very long data.frame and store extracted matches in something like a column-list in the data.frame?
Thanks everyone