Read the first two lines of each document in a corpus in R

Question

I am having trouble figuring out how to read the first two lines of each document in a corpus in R. The first two lines contain headlines from news articles that I want to analyze. I want to search the headlines (not the rest of each text) for the word 'abortion.'

Here is my code for creating the corpus:

myCorp <- corpus(readtext(file='~/R/win-library/3.3/quanteda/Abortion/1972/*'))

I have tried using readLines in a for loop:

for (mycorp in myCorp) {
titles <- readLines(mycorp, n = 2)
write.table(mycorp, "1972_text_P.txt", sep="\n\n", append=TRUE)
write.table(titles, "1972_text_P.txt", append=TRUE)
}

Error in readLines(mycorp, n = 2) : 'con' is not a connection

I have intentionally not created a DFM because I want to keep the 465 files as single documents in the corpus. How can I get the headlines from the article textx? Or, ideally, how would I search only the first two lines of each document for a keyword (abortion) and create a file that contains only those headlines with the keyword in them? Thanks for any and all help with this.

Ken Benoit Ken Benoit · Accepted Answer · 2017-04-02T15:52:00

I'd suggest two options:

regex substitution to keep just first 2 lines

If your first two lines contain what you need, then just extract them using a regular expression that plucks out the first two lines. This is faster than the loop.

@rconradin's solution works but as you will note in ?corpus, we strongly discourage directly accessing a corpus object's internals (since it will change soon). Not looping is also faster.

# test corpus for demonstration
testcorp <- corpus(c(
    d1 = "This is doc1, line 1.\nDoc1, Line 2.\nLine 3 of doc1.",
    d2 = "This is doc2, line 1.\nDoc2, Line 2.\nLine 3 of doc2."
))

summary(testcorp)
## Corpus consisting of 2 documents.
## 
##  Text Types Tokens Sentences
##    d1    12     17         3
##    d2    12     17         3

Now overwrite the texts with just the first two lines. (This also discards the second newline, which if you wish to keep, just move it to the first capture group.)

texts(testcorp) <- 
    stringi::stri_replace_all_regex(texts(testcorp), "(.*\\n.*)(\\n).*", "$1")
## Corpus consisting of 2 documents.
## 
##  Text Types Tokens Sentences
##    d1    10     12         2
##    d2    10     12         2

texts(testcorp)
##                                     d1                                     d2 
## "This is doc1, line 1.\nDoc1, Line 2." "This is doc2, line 1.\nDoc2, Line 2."

using `corpus_segment()`:

Another solution would have been to use corpus_segment():

testcorp2 <- corpus_segment(testcorp, what = "other", delimiter = "\\n", 
                            valuetype = "regex")
summary(testcorp2)
## Corpus consisting of 6 documents.
## 
##  Text Types Tokens Sentences
##  d1.1     7      7         1
##  d1.2     5      5         1
##  d1.3     5      5         1
##  d2.1     7      7         1
##  d2.2     5      5         1
##  d2.3     5      5         1

# get the serial number from each docname
docvars(testcorp2, "sentenceno") <- 
    as.integer(gsub(".*\\.(\\d+)", "\\1", docnames(testcorp2)))
summary(testcorp2)
## Corpus consisting of 6 documents.
## 
##  Text Types Tokens Sentences sentenceno
##  d1.1     7      7         1          1
##  d1.2     5      5         1          2
##  d1.3     5      5         1          3
##  d2.1     7      7         1          1
##  d2.2     5      5         1          2
##  d2.3     5      5         1          3

testcorp3 <- corpus_subset(testcorp2, sentenceno <= 2)
texts(testcorp3)
##                    d1.1                    d1.2                    d2.1                    d2.2 
## "This is doc1, line 1."         "Doc1, Line 2." "This is doc2, line 1."         "Doc2, Line 2."

Read the first two lines of each document in a corpus in R

2 Answers

regex substitution to keep just first 2 lines

using corpus_segment():

using `corpus_segment()`: