I'm scraping this website using the "rvest"-package. When I iterate my function too many times I get "Error in open.connection(x, "rb") : Timeout was reached". I have searched for similar questions but the answers seems to lead to dead ends. I have a suspicion that it is server side and the website has a build-in restriction on how many times I can visit the page. How do investigate this hypothesis?
The code: I have the links to the underlying web pages and want to construct a data frame with the information extracted from the associated web pages. I have simplified my scraping function a bit as the problem is still occurring with a simpler function:
scrape_test = function(link) {
slit <- str_split(link, "/") %>%
unlist()
id <- slit[5]
sem <- slit[6]
name <- link %>%
read_html(encoding = "UTF-8") %>%
html_nodes("h2") %>%
html_text() %>%
str_replace_all("\r\n", "") %>%
str_trim()
return(data.frame(id, sem, name))
}
I use the purrr-package map_df() to iterate the function:
test.data = links %>%
map_df(scrape_test)
Now, if I iterate the function using only 50 links I receive no error. But when I increase the number of links I encounter the before-mentioned error. Furthermore I get the following warnings:
- "In bind_rows_(x, .id) : Unequal factor levels: coercing to character"
- "closing unused connection 4 (link)"
EDIT: The following code making an object of links can be used to reproduce my results:
links <- c(rep("http://karakterstatistik.stads.ku.dk/Histogram/NMAK13032E/Winter-2013/B2", 100))
Sys.sleep(x)
(insert yourx
of choice) into your function to pause in between requests? – Weihuang Wong