2
votes

When I do some webscraping (using a for loop to scrap multiple pages), sometimes, after scraping the 35th out of 40 pages, I have the following error:

“Error in open.connection(x, "rb") : Timeout was reached”

And sometimes I receive in addition this message:

“In addition: Warning message: closing unused connection 3”

Below a list of things I would like to clarify:

1) I have read it might need to define explicitly the user agent. I have tried that with:

read_html(curl('www.link.com', handle = curl::new_handle("useragent" = "Mozilla/5.0")))

but it did not change anything.

2) I noticed that when I turn on a VPN, and change location, sometimes my scraping works without any error. I would like to understand why?

3) I have also read it might depend of the proxy. How would like to understand how and why?

4) In addition to the error I have, I would like to understand this warning, has it might be a clue that leads to understand the error:

Warning message: closing unused connection 3

Does that mean that when I am doing webscraping I should somehow at the end call a function to close a connection?

I have already read the following posts on stackoverflow but there is no clear resolution:

Iterating rvest scrape function gives: "Error in open.connection(x, "rb") : Timeout was reached"

rvest Error in open.connection(x, "rb") : Timeout was reached

Error in open.connection(x, "rb") : Couldn't connect to server

1
Up......anyone?ML_Enthousiast
.........................?ML_Enthousiast

1 Answers

0
votes

Did you try this?

https://stackoverflow.com/a/38463559

library(rvest)
url = "http://google.com"
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
content <- read_html("scrapedpage.html")