The docs are out of date / incomplete. Jsoup does use the charset meta tag, as well as the http-equiv tag to detect the charset. From the source, we see that this method looks like this:
public static Document parse(File in, String charsetName) throws IOException {
return DataUtil.load(in, charsetName, in.getAbsolutePath());
}
DataUtil.load in turn calls parseByteData(...), which looks like this: (Source, scroll down)
//reads bytes first into a buffer, then decodes with the appropriate charset. done this way to support
// switching the chartset midstream when a meta http-equiv tag defines the charset.
// todo - this is getting gnarly. needs a rewrite.
static Document parseByteData(ByteBuffer byteData, String charsetName, String baseUri, Parser parser) {
String docData;
Document doc = null;
if (charsetName == null) { // determine from meta. safe parse as UTF-8
// look for <meta http-equiv="Content-Type" content="text/html;charset=gb2312"> or HTML5 <meta charset="gb2312">
docData = Charset.forName(defaultCharset).decode(byteData).toString();
doc = parser.parseInput(docData, baseUri);
Element meta = doc.select("meta[http-equiv=content-type], meta[charset]").first();
if (meta != null) { // if not found, will keep utf-8 as best attempt
String foundCharset = null;
if (meta.hasAttr("http-equiv")) {
foundCharset = getCharsetFromContentType(meta.attr("content"));
}
if (foundCharset == null && meta.hasAttr("charset")) {
try {
if (Charset.isSupported(meta.attr("charset"))) {
foundCharset = meta.attr("charset");
}
} catch (IllegalCharsetNameException e) {
foundCharset = null;
}
}
(Snip...)
The following line from the above code snippet shows us that indeed, it uses either meta[http-equiv=content-type] or meta[charset] to detect the encoding, otherwise falling back to utf8.
Element meta = doc.select("meta[http-equiv=content-type], meta[charset]").first();
I'm not quite sure what you mean here, but no, the output charset setting controls what characters are escaped when the document HTML / XML is printed to string, whereas the input charset determines how the file is read.
It will only ever remove meta[name=charset] items. From the source, the method which updates / removes the charset definition in the document: (Source, again scroll down)
private void ensureMetaCharsetElement() {
if (updateMetaCharset) {
OutputSettings.Syntax syntax = outputSettings().syntax();
if (syntax == OutputSettings.Syntax.html) {
Element metaCharset = select("meta[charset]").first();
if (metaCharset != null) {
metaCharset.attr("charset", charset().displayName());
} else {
Element head = head();
if (head != null) {
head.appendElement("meta").attr("charset", charset().displayName());
}
}
// Remove obsolete elements
select("meta[name=charset]").remove();
} else if (syntax == OutputSettings.Syntax.xml) {
(Snip..)
Essentially, if you call charset(...) and it does not have a charset meta tag, it will add one, otherwise update the existing one. It does not touch the http-equiv tag.
If you want to find out if the documet specifies an encoding, just look for http-equiv charset or meta charset tags, and if there are no such tags, this means that the document does not specify an encoding.
Jsoup is opens source, you can look at the source yourself to see exactly how it works: https://github.com/jhy/jsoup/ (You can also modify it to do exactly what you want!)
I'll update this answer with further details when I have time. Let me know if you have any other questions.