0
votes

Trying to debug a web scraper and I'm running into an encoding issue using Hadley's rvest package.

As a reproducible example, consider the following two links:

library(rvest)

## This works:
read_html("http://clasificadosonline.com/UDRealEstateDetail.asp?ID=4234361")

## This gives me an error:
read_html("http://clasificadosonline.com/UDRealEstateDetail.asp?ID=4252734")

First link:

{xml_document}
<html>
[1] <head>\n<script type="text/javascript">\r\n\r\n\t\r\nif (screen.width <= 480) {\r\n\tdocument.location = "http://www.clasificado ...
[2] <body>\n<br><link href="StylesClas.css" rel="stylesheet" type="text/css">\n<!-- Google Tag Manager --><noscript><iframe src="//w ...

Second link:

> read_html("http://clasificadosonline.com/UDRealEstateDetail.asp?ID=4252734")
Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html,  : 
  Input is not proper UTF-8, indicate encoding !
Bytes: 0xDA 0x4C 0x54 0x49 [9]

Inspecting the HTML for BOTH pages, and I see the following:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

Why does one work, but the other doesn't?

I have tried wrapping x in read_html() with iconv() as shown in the following related questions and it did not work:

  1. R: rvest - is not proper UTF-8, indicate encoding?
  2. encoding error with read_html

EDIT:

I am using the following packages:

  • rvest_0.3.2
  • xml2_1.2.0
  • httr_1.3.1

Any ideas?? Thanks!!

1

1 Answers

3
votes

Use

read_html("http://clasificadosonline.com/UDRealEstateDetail.asp?ID=4252734",
    encoding="iso-8859-1")

Since that is what the document says. The problem with putting that data in the meta tag is that R needs to be able to read the file in order to read that tag, but if it doesn't have the right encoding, it can't read the file.