What is the right way to make corpus with readtext and quanteda?

Question

I need some help. I'm trying to make some corpus samples using the quanteda package, but it doesn't work as expected.

library(quanteda)
library(readtext)

news <- corpus(readtext('./final/en_US/en_US.news.txt', dvsep = ' '))
#Yeah, it's from Coursera

And then I try to take a sample from the whole corpus:

set.seed(362)
newsSample <- corpus_sample(news, size = 5000)

R-studio says me, that it Cannot take a sample larger than the population, but I'm sure that the population is much more than size, file has about 77k lines. One more thing, after readtext I got the matrix with 1 obs. of 2 variables. The second var is the whole text from file.

What am I doing wrong?

Seems like I've found one possible solution - not to use the readtext package at all. But it goes as an obvious partner for quanteda in most manuals I found today. Used readLines instead and corpus is just what I expected. Will be glad if somebody have a solution with readtext. — Sukharkov

phiver phiver · Accepted Answer · 2020-12-25T09:46:25

You only have 1 document in the corpus when using readtext to read in a single document. There might be 77k lines in the document, but it comes only from 1 document, not 77k documents. If you check the outcome of readtext you will see only 1 value in the column doc_id, and all the text would be in a single cell of the text column. See the differences in the example below.

library(readtext)
library(quanteda)
DATA_DIR <- system.file("extdata/", package = "readtext")

rt2 <- readtext(paste0(DATA_DIR, "/txt/EU_manifestos/EU_euro_2004_de_PSE.txt"),
                docvarsfrom = "filenames", 
                docvarnames = c("unit", "context", "year", "language", "party"),
                encoding = "LATIN1")
rt2
readtext object consisting of 1 document and 5 docvars.
# Description: df[,7] [1 x 7]
  doc_id                  text                unit  context  year language party
  <chr>                   <chr>               <chr> <chr>   <int> <chr>    <chr>
1 EU_euro_2004_de_PSE.txt "\"PES · PSE \"..." EU    euro     2004 de       PSE  

my_corp <- corpus(rt2)
Corpus consisting of 1 document and 5 docvars.
EU_euro_2004_de_PSE.txt :
"PES · PSE · SPE European Parliament rue Wiertz B 1047 Brusse..."

and

rl1 <- readLines(paste0(DATA_DIR, "/txt/EU_manifestos/EU_euro_2004_de_PSE.txt"))
           
my_corp_rl1 <- corpus(rl1)
my_corp_rl1
Corpus consisting of 100 documents.
text1 :
"PES · PSE · SPE European Parliament rue Wiertz B 1047 Brusse..."

text2 :
""

text3 :
"GEMEINSAM WERDEN WIR STÄRKER Fünf Verpflichtungen für die nä..."

text4 :
"Manifest der Sozialdemokratischen Partei Europas für die Wah..."

text5 :
"PARTY OF EUROPEAN SOCIALISTS · Tel +32 2 284 29 76 · Fax +32..."

text6 :
""

[ reached max_ndoc ... 94 more documents ]

Using readLines and then corpus, will create a corpus with 100 documents, but these are just the lines that were just read in and that is not a correct definition of a corpus.

corpus_sample samples the documents in the corpus. So if you have 100 documents in there, corpus_sample(my_corpus, 50) would sample 50 different documents.

You need to check what kind of sampling you need to be done, documents or features. If features, you need to use dfm_sample with margin = "features". See the help in quanteda for more info. And if you need to do the sampling after text cleaning, removing stopwords etc etc.

What is the right way to make corpus with readtext and quanteda?

1 Answers