I am currently working with the quanteda package and I would like to segment my corpus according to the pattern of having at least two regex space characters in a row. However, I am unsure how the corpus_segment() function really works. I have constructed the following small example to illustrate my questions:
test <- "start middle end"
test <- corpus(test)
test
Corpus consisting of 1 document and 0 docvars.
texts(test)
text1
"start middle end"
Now I would like to segment the document according to the pattern of at least two regex space characters:
texts(corpus_segment(test, pattern="\\s{2,}", valuetype = "regex"))
text1.1 text1.2
"middle" "end"
Now the word in front of the first pattern match has been removed. After having a look at the documentation I saw that remove_pattern equals TRUE by default. However, I do not see why it also removes the word in front of the first pattern match. My initial guess was that it has something to do with the arguement pattern_position and indeed if I set it to "after" the following happens:
texts(corpus_segment(test, pattern="\\s{2,}", valuetype = "regex", pattern_position="after"))
text1.1 text1.2
"start" "middle"
So the word after the last pattern is cut out. I figured out that setting remove_pattern to "FALSE" retains all three words and does what I intended it to do:
texts(corpus_segment(test, pattern="\\s{2,}", valuetype = "regex", extract_pattern=FALSE))
text1.1 text1.2 text1.3
"start" "middle" "end"
The documentation of the function states the following:
"extract_pattern: extracts matched patterns from the texts and save in docvars if TRUE"
"pattern_position: either "before" or "after", depending on whether the pattern precedes the text (as with a user-supplied tag, such as ##INTRO in the examples below) or follows the text (as with punctuation delimiters)"
and I do not see how this documentations explains why "start" or "end" are cut out depending on the pattern_position parameter.