You only have 1 document in the corpus when using readtext to read in a single document. There might be 77k lines in the document, but it comes only from 1 document, not 77k documents. If you check the outcome of readtext you will see only 1 value in the column doc_id, and all the text would be in a single cell of the text column. See the differences in the example below.
library(readtext)
library(quanteda)
DATA_DIR <- system.file("extdata/", package = "readtext")
rt2 <- readtext(paste0(DATA_DIR, "/txt/EU_manifestos/EU_euro_2004_de_PSE.txt"),
docvarsfrom = "filenames",
docvarnames = c("unit", "context", "year", "language", "party"),
encoding = "LATIN1")
rt2
readtext object consisting of 1 document and 5 docvars.
# Description: df[,7] [1 x 7]
doc_id text unit context year language party
<chr> <chr> <chr> <chr> <int> <chr> <chr>
1 EU_euro_2004_de_PSE.txt "\"PES · PSE \"..." EU euro 2004 de PSE
my_corp <- corpus(rt2)
Corpus consisting of 1 document and 5 docvars.
EU_euro_2004_de_PSE.txt :
"PES · PSE · SPE European Parliament rue Wiertz B 1047 Brusse..."
and
rl1 <- readLines(paste0(DATA_DIR, "/txt/EU_manifestos/EU_euro_2004_de_PSE.txt"))
my_corp_rl1 <- corpus(rl1)
my_corp_rl1
Corpus consisting of 100 documents.
text1 :
"PES · PSE · SPE European Parliament rue Wiertz B 1047 Brusse..."
text2 :
""
text3 :
"GEMEINSAM WERDEN WIR STÄRKER Fünf Verpflichtungen für die nä..."
text4 :
"Manifest der Sozialdemokratischen Partei Europas für die Wah..."
text5 :
"PARTY OF EUROPEAN SOCIALISTS · Tel +32 2 284 29 76 · Fax +32..."
text6 :
""
[ reached max_ndoc ... 94 more documents ]
Using readLines and then corpus, will create a corpus with 100 documents, but these are just the lines that were just read in and that is not a correct definition of a corpus.
corpus_sample samples the documents in the corpus. So if you have 100 documents in there, corpus_sample(my_corpus, 50) would sample 50 different documents.
You need to check what kind of sampling you need to be done, documents or features. If features, you need to use dfm_sample with margin = "features". See the help in quanteda for more info. And if you need to do the sampling after text cleaning, removing stopwords etc etc.