HTMLCLEANER handle Spanish characters

Question

I am using HtmlCleaner library in order to parse/convert HTML files in java.

It seems that is not able to handle Spanish characters like 'ÁáÉéÍíÑñÓóÚúÜü'

Is there any property which I can set in HtmlCleaner for handling this or any other solution? Here's the code I'm using to invoke it:

CleanerProperties props = new CleanerProperties();
props.setRecognizeUnicodeChars(true);
java.io.File file = new java.io.File("C:\\example.html");
TagNode tagNode = new HtmlCleaner(props).clean(file);

I'm using UTF-8 when writing to a file. new PrettyHtmlSerializer(props).writeToFile(tagNode, filePath, "utf-8"); — choop
How's it being read back? Where do you actually see the errors? Can you verify that HtmlCleaner is actually reading the file as UTF-8? — Rup

Rup Rup · Accepted Answer · 2012-04-25T15:00:27

HtmlCleaner uses the default character set read from the JVM unless specified. On Windows this will be Cp1512 not UTF-8, which is probably where it's going wrong.

You can either

specify -Dfile.encoding=UTF-8 on your JVM start line
use the HtmlCleaner.clean() overload that accepts a character set
```
TagNode tagNode = new HtmlCleaner(props).clean(file, "UTF-8");
```
(if you've got Google Guava in the project you can use Charsets.UTF_8 for the constant)
use the HtmlCleaner.clean() overload that accepts an InputStreamReader which you've already constructed with the correct character set.

HTMLCLEANER handle Spanish characters

2 Answers