0
votes

I need some help. I'm trying to make some corpus samples using the quanteda package, but it doesn't work as expected.

library(quanteda)
library(readtext)

news <- corpus(readtext('./final/en_US/en_US.news.txt', dvsep = ' '))
#Yeah, it's from Coursera

And then I try to take a sample from the whole corpus:

set.seed(362)
newsSample <- corpus_sample(news, size = 5000)

R-studio says me, that it Cannot take a sample larger than the population, but I'm sure that the population is much more than size, file has about 77k lines. One more thing, after readtext I got the matrix with 1 obs. of 2 variables. The second var is the whole text from file.

What am I doing wrong?

1
Seems like I've found one possible solution - not to use the readtext package at all. But it goes as an obvious partner for quanteda in most manuals I found today. Used readLines instead and corpus is just what I expected. Will be glad if somebody have a solution with readtext. - Sukharkov

1 Answers

0
votes

You only have 1 document in the corpus when using readtext to read in a single document. There might be 77k lines in the document, but it comes only from 1 document, not 77k documents. If you check the outcome of readtext you will see only 1 value in the column doc_id, and all the text would be in a single cell of the text column. See the differences in the example below.

library(readtext)
library(quanteda)
DATA_DIR <- system.file("extdata/", package = "readtext")

rt2 <- readtext(paste0(DATA_DIR, "/txt/EU_manifestos/EU_euro_2004_de_PSE.txt"),
                docvarsfrom = "filenames", 
                docvarnames = c("unit", "context", "year", "language", "party"),
                encoding = "LATIN1")
rt2
readtext object consisting of 1 document and 5 docvars.
# Description: df[,7] [1 x 7]
  doc_id                  text                unit  context  year language party
  <chr>                   <chr>               <chr> <chr>   <int> <chr>    <chr>
1 EU_euro_2004_de_PSE.txt "\"PES · PSE \"..." EU    euro     2004 de       PSE  

my_corp <- corpus(rt2)
Corpus consisting of 1 document and 5 docvars.
EU_euro_2004_de_PSE.txt :
"PES · PSE · SPE European Parliament rue Wiertz B 1047 Brusse..."

and

rl1 <- readLines(paste0(DATA_DIR, "/txt/EU_manifestos/EU_euro_2004_de_PSE.txt"))
           
my_corp_rl1 <- corpus(rl1)
my_corp_rl1
Corpus consisting of 100 documents.
text1 :
"PES · PSE · SPE European Parliament rue Wiertz B 1047 Brusse..."

text2 :
""

text3 :
"GEMEINSAM WERDEN WIR STÄRKER Fünf Verpflichtungen für die nä..."

text4 :
"Manifest der Sozialdemokratischen Partei Europas für die Wah..."

text5 :
"PARTY OF EUROPEAN SOCIALISTS · Tel +32 2 284 29 76 · Fax +32..."

text6 :
""

[ reached max_ndoc ... 94 more documents ]

Using readLines and then corpus, will create a corpus with 100 documents, but these are just the lines that were just read in and that is not a correct definition of a corpus.

corpus_sample samples the documents in the corpus. So if you have 100 documents in there, corpus_sample(my_corpus, 50) would sample 50 different documents.

You need to check what kind of sampling you need to be done, documents or features. If features, you need to use dfm_sample with margin = "features". See the help in quanteda for more info. And if you need to do the sampling after text cleaning, removing stopwords etc etc.