1
votes

First question here, so apologises for any faux-pas. I have a dataframe in R of 657 observations with 4 variables. Each observation is a speech or interview by the Australian Prime Minister. So the variables are:

  • date
  • title
  • URL
  • txt (full text of the speech/interview).

I'm trying to turn that into a corpus in Quanteda

I first tried corp <- corpus(all_content) but that gave me an error message

Error in corpus.data.frame(all_content) : 
  text_field column not found or invalid

This worked though: corp <- corpus(paste(all_content))

Then summary(corp) which gave me

Corpus consisting of 4 documents, showing 4 documents:

  Text Types  Tokens Sentences
 text1   243    1316         1
 text2  1095    6523         3
 text3   661    2630         1
 text4 25243 1867648     62572

My understand is that what this has done is effectively turn each column into a document, rather than each row?

If it matters, the txt variable is saved as a list. The code used to create each row is

```{r new_function}
scrape_speech <- function(url){
speech_page <- read_html(url)
     
     date <- speech_page %>% html_nodes(".date-display-single") %>% html_text() %>% dmy()
     title <- speech_page %>% html_nodes(".pagetitle") %>% html_text()
     txt <- speech_page %>% html_nodes("#block-system-main p") %>% html_text() %>% list()
     
     tibble (date = date, title = title, URL = url, txt=txt)}

I then used the map_dfr function to go through and scrape the 657 separate URLs.

Someone has suggested to me it is because the txt is saved as a list. I've tried without the list() in the function and I get 21,904 observations, as each paragraph in the full text document turns into a separate observation. I can turn that into a corpus with corp <- corpus(paste(all_content_not_list)) (Once again, without the paste I get the same error as above). That similarly gives me 4 documents in the corpus! summary(corp) Gives me

Corpus consisting of 4 documents, showing 4 documents:

  Text Types  Tokens Sentences
 text1   243   43810         1
 text2  1092  214970        25
 text3   657   87618         1
 text4 25243 1865687     62626

Thanks in advance Daniel

1

1 Answers

1
votes

It's hard to address this problem exactly, because there is no reproducible example of your data.frame object, but if the structure contains the variables you list, then this should do it:

corpus(all_content, text_field = "txt")

See ?corpus.data.frame for details. If that does not do it, then try adding the output to your question of

str(all_content)

so that we can see in more detail what is in your all_content object.

Edited following OP's addition of new data:

OK so txt in your tibble is a list of character elements. You need to combine these into a single character in order use this as an input into corpus.data.frame(). Here's how:

library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

dframe <- structure(list(
  date = structure(18620, class = "Date"),
  title = " Prime Minister's Christmas Message to the ADF",
  URL = "https://www.pm.gov.au/media/prime-ministers-christmas-message-adf",
  txt = list(c(
    "G'day and Merry Christmas to everyone in our Australian Defence Force.",
    "You know, throughout our history, successive Australian governments... And this year was no different.",
    "God bless."
  ))
),
row.names = c(NA, -1L),
class = c("tbl_df", "tbl", "data.frame")
)

dframe$txt <- vapply(dframe$txt, paste, character(1), collapse = " ")

corp <- corpus(dframe, text_field = "txt")
print(corp, max_nchar = -1)
## Corpus consisting of 1 document and 3 docvars.
## text1 :
## "G'day and Merry Christmas to everyone in our Australian Defence Force. You know, throughout our history, successive Australian governments... And this year was no different. God bless."

Created on 2021-04-08 by the reprex package (v1.0.0)