Quanteda - creating a corpus from a dataframe with multiple documents

Question

First question here, so apologises for any faux-pas. I have a dataframe in R of 657 observations with 4 variables. Each observation is a speech or interview by the Australian Prime Minister. So the variables are:

date
title
URL
txt (full text of the speech/interview).

I'm trying to turn that into a corpus in Quanteda

I first tried corp <- corpus(all_content) but that gave me an error message

Error in corpus.data.frame(all_content) : 
  text_field column not found or invalid

This worked though: corp <- corpus(paste(all_content))

Then summary(corp) which gave me

Corpus consisting of 4 documents, showing 4 documents:

  Text Types  Tokens Sentences
 text1   243    1316         1
 text2  1095    6523         3
 text3   661    2630         1
 text4 25243 1867648     62572

My understand is that what this has done is effectively turn each column into a document, rather than each row?

If it matters, the txt variable is saved as a list. The code used to create each row is

```{r new_function}
scrape_speech <- function(url){
speech_page <- read_html(url)
     
     date <- speech_page %>% html_nodes(".date-display-single") %>% html_text() %>% dmy()
     title <- speech_page %>% html_nodes(".pagetitle") %>% html_text()
     txt <- speech_page %>% html_nodes("#block-system-main p") %>% html_text() %>% list()
     
     tibble (date = date, title = title, URL = url, txt=txt)}

I then used the map_dfr function to go through and scrape the 657 separate URLs.

Someone has suggested to me it is because the txt is saved as a list. I've tried without the list() in the function and I get 21,904 observations, as each paragraph in the full text document turns into a separate observation. I can turn that into a corpus with corp <- corpus(paste(all_content_not_list)) (Once again, without the paste I get the same error as above). That similarly gives me 4 documents in the corpus! summary(corp) Gives me

Corpus consisting of 4 documents, showing 4 documents:

  Text Types  Tokens Sentences
 text1   243   43810         1
 text2  1092  214970        25
 text3   657   87618         1
 text4 25243 1865687     62626

Thanks in advance Daniel

Ken Benoit Ken Benoit · Accepted Answer · 2021-04-08T08:48:56

It's hard to address this problem exactly, because there is no reproducible example of your data.frame object, but if the structure contains the variables you list, then this should do it:

corpus(all_content, text_field = "txt")

See ?corpus.data.frame for details. If that does not do it, then try adding the output to your question of

str(all_content)

so that we can see in more detail what is in your all_content object.

Edited following OP's addition of new data:

OK so txt in your tibble is a list of character elements. You need to combine these into a single character in order use this as an input into corpus.data.frame(). Here's how:

library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

dframe <- structure(list(
  date = structure(18620, class = "Date"),
  title = " Prime Minister's Christmas Message to the ADF",
  URL = "https://www.pm.gov.au/media/prime-ministers-christmas-message-adf",
  txt = list(c(
    "G'day and Merry Christmas to everyone in our Australian Defence Force.",
    "You know, throughout our history, successive Australian governments... And this year was no different.",
    "God bless."
  ))
),
row.names = c(NA, -1L),
class = c("tbl_df", "tbl", "data.frame")
)

dframe$txt <- vapply(dframe$txt, paste, character(1), collapse = " ")

corp <- corpus(dframe, text_field = "txt")
print(corp, max_nchar = -1)
## Corpus consisting of 1 document and 3 docvars.
## text1 :
## "G'day and Merry Christmas to everyone in our Australian Defence Force. You know, throughout our history, successive Australian governments... And this year was no different. God bless."

^{Created on 2021-04-08 by the reprex package (v1.0.0)}

Quanteda - creating a corpus from a dataframe with multiple documents

1 Answers