Trying to debug a web scraper and I'm running into an encoding issue using Hadley's rvest
package.
As a reproducible example, consider the following two links:
library(rvest)
## This works:
read_html("http://clasificadosonline.com/UDRealEstateDetail.asp?ID=4234361")
## This gives me an error:
read_html("http://clasificadosonline.com/UDRealEstateDetail.asp?ID=4252734")
First link:
{xml_document}
<html>
[1] <head>\n<script type="text/javascript">\r\n\r\n\t\r\nif (screen.width <= 480) {\r\n\tdocument.location = "http://www.clasificado ...
[2] <body>\n<br><link href="StylesClas.css" rel="stylesheet" type="text/css">\n<!-- Google Tag Manager --><noscript><iframe src="//w ...
Second link:
> read_html("http://clasificadosonline.com/UDRealEstateDetail.asp?ID=4252734")
Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, :
Input is not proper UTF-8, indicate encoding !
Bytes: 0xDA 0x4C 0x54 0x49 [9]
Inspecting the HTML for BOTH pages, and I see the following:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
Why does one work, but the other doesn't?
I have tried wrapping x
in read_html()
with iconv()
as shown in the following related questions and it did not work:
EDIT:
I am using the following packages:
rvest_0.3.2
xml2_1.2.0
httr_1.3.1
Any ideas?? Thanks!!