segment corpus in quanteda

Question

I have a single text file that contains many speeches. The file contains two variables, one for speech_id and the other for the text of the speech and are separated by a pipe |. I’m trying to use the corpus_segment function in quanteda to break the text into smaller documents.

The .txt file looks like this:

Speech_id|speech1140000001|This is the first speech.1140000002|The second 
speech starts here.1140000003|This is the third speech.1140000004|The fourth 
speaker says this.

I’ve tried various iterations, but can’t seem to get it to work. I've also tried using the readtext function from the readtext package to read it in but no luck. Any help is greatly appreciated.

Ken Benoit Ken Benoit · Accepted Answer · 2018-03-27T06:44:27

corpus_segment() should work fine. (This is based on quanteda >= 1.0.0.) Here, I am assuming that all speech IDs are 10 digits followed by the | character. Note that readtext would have worked to read this .txt file but that it would have been a single "document" of one row.

library("quanteda")

txt <- "Speech_id|speech1140000001|This is the first speech.1140000002|The second 
speech starts here.1140000003|This is the third speech.1140000004|The fourth 
speaker says this."

corp <- corpus(txt)

corpseg <- corpus_segment(corp, pattern = "\\d{10}\\|", valuetype = "regex")
texts(corpseg)
##                     text1.1                            text1.2 
## "This is the first speech." "The second \nspeech starts here." 
##                     text1.3                            text1.4 
## "This is the third speech."  "The fourth \nspeaker says this."

That got it, but we can tidy it up a bit more by moving the pattern that was extracted to be a docname.

# move the tag to docname after removing "|"
docnames(corpseg) <- 
    stringi::stri_replace_all_fixed(docvars(corpseg, "pattern"), "|", "")
# remove the pattern as a docvar
docvars(corpseg, "pattern") <- NULL

summary(corpseg)
## Corpus consisting of 4 documents:
##     
##       Text Types Tokens Sentences
## 1140000001     6      6         1
## 1140000002     6      6         1
## 1140000003     6      6         1
## 1140000004     6      6         1
## 
## Source: /Users/kbenoit/Dropbox (Personal)/tmp/ascharacter/* on x86_64 by kbenoit
## Created: Tue Mar 27 07:41:05 2018
## Notes: corpus_segment.corpus(corp, pattern = "\\d{10}\\|", valuetype = "regex")

segment corpus in quanteda

1 Answers