I'm trying to scrape a load of debates using rvest. The debates are on different webpages and I collect the urls for these webpages from a search result. There are over 1000 pages of search results, with 20,000 pages of debates (i.e. 20,000 urls).
My current approach successfully scrapes the data I need from the debate pages, however, for anything more than 20 pages of search results (i.e. only 400 of 20,000 urls) the processing takes an extremely long time.
I'm currently using a forloop which iterates through my list of urls and scrapes 5 html nodes with the content I need (See below). This creates a vector for each node of content I'm scraping which I then combine into a data frame subsequently for analysis. I think this approach means that I call each webpage 5 separate times for the different html nodes I need.
Is there any way to scrape this more efficiently? I'm sure there is a way I could do it so that it would scrape all 5 nodes on one call to each url rather than iterating 5 times. Also, would it be possible to populate the data frame dynamically in the for loop rather than storing 5 different vectors. Also, maybe I could use parallel processing to scrape multiple urls at the same time? I'm pretty much stumped.
#create empty
speakerid <- c()
parties <- c()
contributions <- c()
titles <- c()
debatedates <- c()
#for loop to scrape relevant content
for(i in debate_urls$url) {
debate_urls <- read_html(i)
speaker <- debate_urls %>% html_nodes(".debate-speech__speaker__name") %>% html_text("")
speakerid = append(speakerid, speaker)
debate_urls <- read_html(i)
party <- debate_urls %>% html_nodes(".debate-speech__speaker__position") %>% html_text("")
parties = append(parties, party)
debate_urls <- read_html(i)
contribution <- debate_urls %>% html_nodes(".debate-speech__speaker+ .debate-speech__content") %>% html_text("p")
contributions = append(contributions, contribution)
debate_urls <- read_html(i)
title <- debate_urls %>%
html_node(".full-page__unit h1") %>%
html_text()
titles = append(titles, rep(title,each=length(contribution)))
debate_urls <- read_html(i)
debatedate <- debate_urls %>%
html_node(".time") %>%
html_text("href")
debatedates = append(debatedates, rep(debatedate,each=length(contribution)))
}
debatedata <- data.frame(Title=titles, Date=debatedates,Speaker=speakerid,Party=parties,Utterance=contributions)
Note: debate_urls is a list of the urls of debate pages.
Any help on how to do any part of this more efficiently would be much appreciated!
debate_urls <- read_html(i)
within each iteration of the loop. Each call toread_html
is requiring reaching out to the internet and waiting for a response. Call it once at the start of the loop and then use "debate_urls" as a constant for the remaining part of the loop. Also, see the answer below concerning growing a vector within the loop. - Dave2e