First question here, so apologises for any faux-pas. I have a dataframe in R of 657 observations with 4 variables. Each observation is a speech or interview by the Australian Prime Minister. So the variables are:
- date
- title
- URL
- txt (full text of the speech/interview).
I'm trying to turn that into a corpus in Quanteda
I first tried corp <- corpus(all_content)
but that gave me an error message
Error in corpus.data.frame(all_content) :
text_field column not found or invalid
This worked though: corp <- corpus(paste(all_content))
Then summary(corp)
which gave me
Corpus consisting of 4 documents, showing 4 documents:
Text Types Tokens Sentences
text1 243 1316 1
text2 1095 6523 3
text3 661 2630 1
text4 25243 1867648 62572
My understand is that what this has done is effectively turn each column into a document, rather than each row?
If it matters, the txt
variable is saved as a list. The code used to create each row is
```{r new_function}
scrape_speech <- function(url){
speech_page <- read_html(url)
date <- speech_page %>% html_nodes(".date-display-single") %>% html_text() %>% dmy()
title <- speech_page %>% html_nodes(".pagetitle") %>% html_text()
txt <- speech_page %>% html_nodes("#block-system-main p") %>% html_text() %>% list()
tibble (date = date, title = title, URL = url, txt=txt)}
I then used the map_dfr
function to go through and scrape the 657 separate URLs.
Someone has suggested to me it is because the txt
is saved as a list. I've tried without the list()
in the function and I get 21,904 observations, as each paragraph in the full text document turns into a separate observation. I can turn that into a corpus with corp <- corpus(paste(all_content_not_list))
(Once again, without the paste
I get the same error as above). That similarly gives me 4 documents in the corpus!
summary(corp)
Gives me
Corpus consisting of 4 documents, showing 4 documents:
Text Types Tokens Sentences
text1 243 43810 1
text2 1092 214970 25
text3 657 87618 1
text4 25243 1865687 62626
Thanks in advance Daniel