jsoup and character encoding

Question

I have a bunch of questions relating to jsoup's charset support, most of which are supported by quotes from the API docs:

jsoup.Jsoup:

public static Document parse(File in, String charsetName) ...
Set to null to determine from http-equiv meta tag, if present, or fall back to UTF-8 ...

Does this mean the 'charset' meta-tag isn't used to detect the encoding?
jsoup.nodes.Document:

public void charset(Charset charset)
... This method is equivalent to OutputSettings.charset(Charset) but in addition ...

public Charset charset()
... This method is equivalent to Document.OutputSettings.charset().

Does this mean there isn't an "input charset" and "output charset", and that they are indeed the same setting?
jsoup.nodes.Document:

public void charset(Charset charset) ... Obsolete charset / encoding definitions are removed!

Will this remove the 'http-equiv' meta-tag in lieu of the 'charset' meta-tag? For backwards compatibility, is there any way to keep both?
jsoup.nodes.Document.OutputSettings:

public Charset charset() Where possible (when parsing from a URL or File), the document's output charset is automatically set to the input charset. Otherwise, it defaults to UTF-8.

I need to know if the document hasn't specified an encoding*. Does this mean jsoup can't provide this information?

* instead of defaulting to UTF-8, I will run juniversalchardet.

JonasCz JonasCz · Accepted Answer · 2016-01-30T09:37:37

The docs are out of date / incomplete. Jsoup does use the charset meta tag, as well as the http-equiv tag to detect the charset. From the source, we see that this method looks like this:

public static Document parse(File in, String charsetName) throws IOException {
    return DataUtil.load(in, charsetName, in.getAbsolutePath());
}

DataUtil.load in turn calls parseByteData(...), which looks like this: (Source, scroll down)

//reads bytes first into a buffer, then decodes with the appropriate charset. done this way to support
// switching the chartset midstream when a meta http-equiv tag defines the charset.
// todo - this is getting gnarly. needs a rewrite.
static Document parseByteData(ByteBuffer byteData, String charsetName, String baseUri, Parser parser) {
  String docData;
  Document doc = null;

   if (charsetName == null) { // determine from meta. safe parse as UTF-8
    // look for <meta http-equiv="Content-Type" content="text/html;charset=gb2312"> or HTML5 <meta charset="gb2312">
    docData = Charset.forName(defaultCharset).decode(byteData).toString();
    doc = parser.parseInput(docData, baseUri);
    Element meta = doc.select("meta[http-equiv=content-type], meta[charset]").first();
    if (meta != null) { // if not found, will keep utf-8 as best attempt
        String foundCharset = null;
        if (meta.hasAttr("http-equiv")) {
            foundCharset = getCharsetFromContentType(meta.attr("content"));
        }
        if (foundCharset == null && meta.hasAttr("charset")) {
            try {
                if (Charset.isSupported(meta.attr("charset"))) {
                    foundCharset = meta.attr("charset");
                }
            } catch (IllegalCharsetNameException e) {
                foundCharset = null;
            }
        }

        (Snip...)

The following line from the above code snippet shows us that indeed, it uses either meta[http-equiv=content-type] or meta[charset] to detect the encoding, otherwise falling back to utf8.

Element meta = doc.select("meta[http-equiv=content-type], meta[charset]").first();

I'm not quite sure what you mean here, but no, the output charset setting controls what characters are escaped when the document HTML / XML is printed to string, whereas the input charset determines how the file is read.

It will only ever remove meta[name=charset] items. From the source, the method which updates / removes the charset definition in the document: (Source, again scroll down)

private void ensureMetaCharsetElement() {
if (updateMetaCharset) {
    OutputSettings.Syntax syntax = outputSettings().syntax();

    if (syntax == OutputSettings.Syntax.html) {
        Element metaCharset = select("meta[charset]").first();

        if (metaCharset != null) {
            metaCharset.attr("charset", charset().displayName());
        } else {
            Element head = head();

            if (head != null) {
                head.appendElement("meta").attr("charset", charset().displayName());
            }
        }

        // Remove obsolete elements
        select("meta[name=charset]").remove();
    } else if (syntax == OutputSettings.Syntax.xml) {
    (Snip..)

Essentially, if you call charset(...) and it does not have a charset meta tag, it will add one, otherwise update the existing one. It does not touch the http-equiv tag.

If you want to find out if the documet specifies an encoding, just look for http-equiv charset or meta charset tags, and if there are no such tags, this means that the document does not specify an encoding.

Jsoup is opens source, you can look at the source yourself to see exactly how it works: https://github.com/jhy/jsoup/ (You can also modify it to do exactly what you want!)

I'll update this answer with further details when I have time. Let me know if you have any other questions.

jsoup and character encoding

1 Answers