16
votes

I'm trying to scrape the content from http://google.com. the error message come out.

library(rvest)  
html("http://google.com")

Error in open.connection(x, "rb") :
Timeout was reached In addition:
Warning message: 'html' is deprecated.
Use 'read_html' instead.
See help("Deprecated")

since I'm using company network ,this maybe caused by firewall or proxy. I try to use set_config ,but not working .

5
have you also tried the read_html command, since the error message says html is deprecated... This might not solve you problem but maybe the output is more helpful...drmariod
yes,the message is :Error in open.connection(x, "rb") : Timeout was reached In addition: Warning message: closing unused connection 3 (google.com)user3267649
actually , this code works fine in my home network. but when I try to use this code in the company network ,the error comes up.user3267649
Seems not reproducible as a code issue, this returns a result for me. If you figured out what was going on with the network and how to work around it you could post that answer.Sam Firke
Same issue for me, apparently from the network I am using google asks proof of not being a bot, and the page of course times out when the scraper runs.Dambo

5 Answers

33
votes

I encountered the same Error in open.connection(x, “rb”) : Timeout was reached issue when working behind a proxy in the office network.

Here's what worked for me,

library(rvest)
url = "http://google.com"
download.file(url, destfile = "scrapedpage.html", quiet=TRUE)
content <- read_html("scrapedpage.html")

Credit : https://stackoverflow.com/a/38463559

7
votes

This is probably an issue with your call to read_html (or html in your case) not properly identifying itself to server it's trying to retrieve content from, which is the default behaviour. Using curl, add a user agent to the handle argument of read_html to have your scraper identify itself.

library(rvest)
library(curl)
read_html(curl('http://google.com', handle = curl::new_handle("useragent" = "Mozilla/5.0")))
1
votes

I ran into this issue because my VPN was switched on. Immediately after turning it off, I re-tried, and it resolved the issue.

0
votes

I was facing a similar problem and a small hack solved it. There were 2 characters in the hyperlink who were creating the problem for me. Hence I replaced "è" with "e" & "é" with "e" and it worked. But just ensure that the hyperlink still remains valid.

0
votes

I got the error message when my laptop was wifi connected to my router, but my ISP was having some sort of an outage:

read_html(brand_url)
Error in open.connection(x, "rb") : 
  Timeout was reached: [somewebsite.com.au] Operation timed out after 10024 milliseconds with 0 out of 0 bytes received

In the above case, my wifi was still connected to the modem, but pages wouldn't load via rvest (nor in a browser). It was temporary and lasted ~2 minutes.

May also be worth noting that a different error message is received when wifi is turned off entirely:

brand_page <- read_html(brand_url)
Error in open.connection(x, "rb") : 
  Could not resolve host: somewebsite.com.au