corpus_segment()
should work fine. (This is based on quanteda >= 1.0.0.) Here, I am assuming that all speech IDs are 10 digits followed by the |
character. Note that readtext would have worked to read this .txt file but that it would have been a single "document" of one row.
library("quanteda")
txt <- "Speech_id|speech1140000001|This is the first speech.1140000002|The second
speech starts here.1140000003|This is the third speech.1140000004|The fourth
speaker says this."
corp <- corpus(txt)
corpseg <- corpus_segment(corp, pattern = "\\d{10}\\|", valuetype = "regex")
texts(corpseg)
## text1.1 text1.2
## "This is the first speech." "The second \nspeech starts here."
## text1.3 text1.4
## "This is the third speech." "The fourth \nspeaker says this."
That got it, but we can tidy it up a bit more by moving the pattern that was extracted to be a docname.
# move the tag to docname after removing "|"
docnames(corpseg) <-
stringi::stri_replace_all_fixed(docvars(corpseg, "pattern"), "|", "")
# remove the pattern as a docvar
docvars(corpseg, "pattern") <- NULL
summary(corpseg)
## Corpus consisting of 4 documents:
##
## Text Types Tokens Sentences
## 1140000001 6 6 1
## 1140000002 6 6 1
## 1140000003 6 6 1
## 1140000004 6 6 1
##
## Source: /Users/kbenoit/Dropbox (Personal)/tmp/ascharacter/* on x86_64 by kbenoit
## Created: Tue Mar 27 07:41:05 2018
## Notes: corpus_segment.corpus(corp, pattern = "\\d{10}\\|", valuetype = "regex")