3
votes

I'm scraping this website using the "rvest"-package. When I iterate my function too many times I get "Error in open.connection(x, "rb") : Timeout was reached". I have searched for similar questions but the answers seems to lead to dead ends. I have a suspicion that it is server side and the website has a build-in restriction on how many times I can visit the page. How do investigate this hypothesis?

The code: I have the links to the underlying web pages and want to construct a data frame with the information extracted from the associated web pages. I have simplified my scraping function a bit as the problem is still occurring with a simpler function:

scrape_test = function(link) {

  slit <-  str_split(link, "/") %>%
    unlist()
  id <- slit[5]
  sem <- slit[6]

  name <- link %>% 
    read_html(encoding = "UTF-8") %>%
    html_nodes("h2") %>%
    html_text() %>%
    str_replace_all("\r\n", "") %>%
    str_trim()

  return(data.frame(id, sem, name))
}

I use the purrr-package map_df() to iterate the function:

test.data = links %>%
  map_df(scrape_test)

Now, if I iterate the function using only 50 links I receive no error. But when I increase the number of links I encounter the before-mentioned error. Furthermore I get the following warnings:

  • "In bind_rows_(x, .id) : Unequal factor levels: coercing to character"
  • "closing unused connection 4 (link)"

EDIT: The following code making an object of links can be used to reproduce my results:

links <- c(rep("http://karakterstatistik.stads.ku.dk/Histogram/NMAK13032E/Winter-2013/B2", 100))
1
I forgot a "link %>%" as I simplified my function directly in stack overflow, sorry about that. The problem lies in the multiple iterations where it suddenly gives the error when you reach too many links.ScrapeGoat
Maybe try inserting a Sys.sleep(x) (insert your x of choice) into your function to pause in between requests?Weihuang Wong
Yes, that would definitely be a possible solution. How do I know what "x" should be? Are there any tricks or is it trial and error? However, as I want to iterate 1000 links as a start and possible 32000 I probably need another solution since setting x=2 would take a long while to run.ScrapeGoat

1 Answers

11
votes

With large scraping tasks I would usually do a for-loop, which helps with troubleshooting. Create an empty list for your output:

d <- vector("list", length(links))

Here I do a for-loop, with a tryCatch block so that if the output is an error, we wait a couple of seconds and try again. We also include a counter that moves on to the next link if we're still getting an error after five attempts. In addition, we have if (!(links[i] %in% names(d))) so that if we have to break the loop, we can skip the links we've already scraped when we restart the loop.

for (i in seq_along(links)) {
  if (!(links[i] %in% names(d))) {
    cat(paste("Doing", links[i], "..."))
    ok <- FALSE
    counter <- 0
    while (ok == FALSE & counter <= 5) {
      counter <- counter + 1
      out <- tryCatch({                  
                  scrape_test(links[i])
                },
                error = function(e) {
                  Sys.sleep(2)
                  e
                }
              )
      if ("error" %in% class(out)) {
        cat(".")
      } else {
        ok <- TRUE
        cat(" Done.")
      }
    }
    cat("\n")
    d[[i]] <- out
    names(d)[i] <- links[i]
  }
}