1
votes

I'm trying to scrape a load of debates using rvest. The debates are on different webpages and I collect the urls for these webpages from a search result. There are over 1000 pages of search results, with 20,000 pages of debates (i.e. 20,000 urls).

My current approach successfully scrapes the data I need from the debate pages, however, for anything more than 20 pages of search results (i.e. only 400 of 20,000 urls) the processing takes an extremely long time.

I'm currently using a forloop which iterates through my list of urls and scrapes 5 html nodes with the content I need (See below). This creates a vector for each node of content I'm scraping which I then combine into a data frame subsequently for analysis. I think this approach means that I call each webpage 5 separate times for the different html nodes I need.

Is there any way to scrape this more efficiently? I'm sure there is a way I could do it so that it would scrape all 5 nodes on one call to each url rather than iterating 5 times. Also, would it be possible to populate the data frame dynamically in the for loop rather than storing 5 different vectors. Also, maybe I could use parallel processing to scrape multiple urls at the same time? I'm pretty much stumped.

#create empty
speakerid <- c()
parties <- c()
contributions <- c()
titles <- c()
debatedates <- c()

#for loop to scrape relevant content
for(i in debate_urls$url) { 

  debate_urls <- read_html(i)
  speaker <- debate_urls %>% html_nodes(".debate-speech__speaker__name") %>% html_text("")
  speakerid = append(speakerid, speaker)

  debate_urls <- read_html(i)
  party <- debate_urls %>% html_nodes(".debate-speech__speaker__position") %>% html_text("")
  parties = append(parties, party)

  debate_urls <- read_html(i)
  contribution <- debate_urls %>% html_nodes(".debate-speech__speaker+ .debate-speech__content") %>% html_text("p")
  contributions = append(contributions, contribution)

  debate_urls <- read_html(i)
  title <- debate_urls %>%
    html_node(".full-page__unit h1") %>%
    html_text()
  titles = append(titles, rep(title,each=length(contribution)))

  debate_urls <- read_html(i)
  debatedate <- debate_urls %>%
    html_node(".time") %>%
    html_text("href")
  debatedates = append(debatedates, rep(debatedate,each=length(contribution)))
  }

debatedata <- data.frame(Title=titles, Date=debatedates,Speaker=speakerid,Party=parties,Utterance=contributions)

Note: debate_urls is a list of the urls of debate pages.

Any help on how to do any part of this more efficiently would be much appreciated!

1
One issue slowing down the process, is the repeated calling debate_urls <- read_html(i) within each iteration of the loop. Each call to read_html is requiring reaching out to the internet and waiting for a response. Call it once at the start of the loop and then use "debate_urls" as a constant for the remaining part of the loop. Also, see the answer below concerning growing a vector within the loop. - Dave2e
Ah yes, that was silly, thanks for highlighting it. Have fixed now. Probably helps a bit but not sure quite how much. - MCC89

1 Answers

0
votes

One thing definitely inefficient in there is continually growing the vectors. You know how long they are (length(debate_urls$url)) so you can set up the vectors in advance:

n <- length(debate_urls$url)
speakerid <- character(n)
parties <- character(n)
contributions <- character(n)
titles <- character(n)
debatedates <- character(n)

Then your for loop does this:

for(idx in seq_along(debate_urls$url)){
    i <- debate_urls$url[idx]

    debate_urls <- read_html(i)
    speaker <- debate_urls %>% html_nodes(".debate-speech__speaker__name") %>% html_text("")
    speakerid[idx] <- speaker
    ...
}

What I'm much less sure about is whether this has much of an effect compared to the scraping time.