6
votes

Info:

I've a program which generates XML sitemaps for Google Webmaster Tools (among other things).
GWTs is giving me errors for some sitemaps because the URLs contain character sequences like ã¾, ã‹, ã€, etc. **

GWTs says:

We require your Sitemap file to be UTF-8 encoded (you can generally do this when you save the file). As with all XML files, any data values (including URLs) must use entity escape codes for the characters: &, ', ", <, >.

The special characters are excaped in the XML files (with HTML entities).
XML file snippet:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <url>
        <loc>http://domain/folder/listing-&#227;&#129;.shtml</loc>
        ...

Are my URLs UTF-8 encoded?

If not, How do I do this in Java?
The following is the line in my program where I add the URL to the sitemap:

    siteMap.addUrl(StringEscapeUtils.escapeXml(countryName+"/"+twoCharFile.getRelativeFileName().toLowerCase()));

** = I'm not sure which ones are causing the error, probably the first two examples.

I apologize for all the editing.

4
I don't really understand your question. It seems as though you haven't HTML escaped you data (regardless of using utf-8). Are you escaping or not? - Assaf Lavie
I edited the question a lot. - Adam Lynch
Open your sitemap XML files in an editor that supports UTF-8 encoding (like Notepad++) for a quick test to determine whether your files are saved in the correct encoding. - Vineet Reynolds
@Vineet Done. Not certain where to look to see if the URLs are correctly UTF-8 encoded. I've supplied a snippet of the XML file. It looks like the characters have been escaped (with HTML entities). - Adam Lynch
the Encoding menu in Notepad++ will allow you to view the current encoding used. You could change the encoding of the file, but that is not the point; use the suggested approach to specify the encoding for the URL. Additionally, also ensure that you write the sitemap file using UTF-8 encoding (when you use the FileOutputStream class or a different class). - Vineet Reynolds

4 Answers

17
votes

Try using URLEncoder.encode(stringToBeEncoded, "UTF-8") to encode the url.

2
votes

URLs must be percent-encoded as per the URI spec.

For example, the code point U+00e3 (ã) would become the encoded sequence %C3%A3.

When a URI is emitted in an XML document, it must conform to the markup requirements for XML.

For example, the URI http://foo/bar?a=b&x=%C3%A3 becomes http://foo/bar?a=b&amp;x=%C3%A3. The ampersand is an escape character in XML.

You can find a detailed discussion of URI encoding here.

2
votes

Don't confuse percentage encoding of non-ASCII characters in URLs with XML entity escapes of characters in URLs. You need to do both when creating XML sitemaps.

In honesty from reading your original post, it seems something funky is going on because the characters you mention remind me of when an unsuccessful conversion has taken place :)

Are you sure those characters truly are part of your URLs when using UTF-8?

1
votes

All non-ascii characters in URL has to be 'x-url-encoding' encoded.

Here is the wiki link that explains it: http://en.wikipedia.org/wiki/Percent-encoding.

In addition all XML special symbols (&, >, <, etc.) also have to be escaped.

Jai's answer shows the correct method to x-url-encode arbitrary string. Note, however, that it does not do XML escaping.