Quanteda Corpus Segment Documentation

Question

I am currently working with the quanteda package and I would like to segment my corpus according to the pattern of having at least two regex space characters in a row. However, I am unsure how the corpus_segment() function really works. I have constructed the following small example to illustrate my questions:

test <- "start  middle  end" 
test <- corpus(test)
test
Corpus consisting of 1 document and 0 docvars.
texts(test)
               text1 
"start  middle  end"

Now I would like to segment the document according to the pattern of at least two regex space characters:

texts(corpus_segment(test, pattern="\\s{2,}", valuetype = "regex"))
 text1.1  text1.2 
"middle"    "end"

Now the word in front of the first pattern match has been removed. After having a look at the documentation I saw that remove_pattern equals TRUE by default. However, I do not see why it also removes the word in front of the first pattern match. My initial guess was that it has something to do with the arguement pattern_position and indeed if I set it to "after" the following happens:

texts(corpus_segment(test, pattern="\\s{2,}", valuetype = "regex", pattern_position="after"))
text1.1  text1.2 
"start" "middle"

So the word after the last pattern is cut out. I figured out that setting remove_pattern to "FALSE" retains all three words and does what I intended it to do:

texts(corpus_segment(test, pattern="\\s{2,}", valuetype = "regex",  extract_pattern=FALSE))
 text1.1  text1.2  text1.3 
 "start" "middle"    "end"

The documentation of the function states the following:

"extract_pattern: extracts matched patterns from the texts and save in docvars if TRUE"

"pattern_position: either "before" or "after", depending on whether the pattern precedes the text (as with a user-supplied tag, such as ##INTRO in the examples below) or follows the text (as with punctuation delimiters)"

and I do not see how this documentations explains why "start" or "end" are cut out depending on the pattern_position parameter.

Ken Benoit Ken Benoit · Accepted Answer · 2020-02-24T21:52:18

That's a good question, and I am filing an issue to see if this is the behaviour we intended.

Note that char_segment() works in the same way.

library("quanteda")
## Package version: 2.0.0

txt <- "start middle end"
corp <- corpus(txt)

corpus_segment(corp, " ", extract_pattern = FALSE)
## Corpus consisting of 3 documents.
## text1.1 :
## "start"
## 
## text1.2 :
## "middle"
## 
## text1.3 :
## "end"
corpus_segment(corp, " ", extract_pattern = TRUE)
## Corpus consisting of 2 documents and 1 docvar.
## text1.1 :
## "middle"
## 
## text1.2 :
## "end"

char_segment(txt, " ", remove_pattern = FALSE)
## [1] "start"  "middle" "end"
char_segment(txt, " ", remove_pattern = TRUE)
## [1] "middle" "end"

There are of course other ways to do this, prior to constructing the quanteda corpus object from your character vector, such as the following. (Wrap them in unlist() if you want a vector back.)

test <- "start  middle  end"

stringi::stri_split_regex(test, "\\p{Zs}{2}")
## [[1]]
## [1] "start"  "middle" "end"

base::strsplit(test, "\\s{2}")
## [[1]]
## [1] "start"  "middle" "end"

Quanteda Corpus Segment Documentation

1 Answers