When I do some webscraping (using a for loop to scrap multiple pages), sometimes, after scraping the 35th out of 40 pages, I have the following error:
“Error in open.connection(x, "rb") : Timeout was reached”
And sometimes I receive in addition this message:
“In addition: Warning message: closing unused connection 3”
Below a list of things I would like to clarify:
1) I have read it might need to define explicitly the user agent. I have tried that with:
read_html(curl('www.link.com', handle = curl::new_handle("useragent" = "Mozilla/5.0")))
but it did not change anything.
2) I noticed that when I turn on a VPN, and change location, sometimes my scraping works without any error. I would like to understand why?
3) I have also read it might depend of the proxy. How would like to understand how and why?
4) In addition to the error I have, I would like to understand this warning, has it might be a clue that leads to understand the error:
Warning message: closing unused connection 3
Does that mean that when I am doing webscraping I should somehow at the end call a function to close a connection?
I have already read the following posts on stackoverflow but there is no clear resolution:
Iterating rvest scrape function gives: "Error in open.connection(x, "rb") : Timeout was reached"
rvest Error in open.connection(x, "rb") : Timeout was reached
Error in open.connection(x, "rb") : Couldn't connect to server