0
votes

We recently ran into some troubles when trying to determine the correct encoding used for a page. We have encounter a page with following setup:

header response:

Content-Type:text/html; charset=GBK

meta tag:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Actual content is in GBK, modern browsers are smart enough to use the right encoding for this page.

But for a crawler (using curl), we are forced to decide picking one charset value over the other. So my question is: Is taking header charset over meta charset the normal thing to do?

(Most content-based encoding detection algorithm we have tried are shaky at best, as long as one charset is more reliable than the other, we prefer using specified charset over anything from our own encoding detection.)

1

1 Answers

2
votes

Is taking header charset over meta charset the normal thing to do?

Yes. See the specification.

HTTP headers are checked at step 4. Meta isn't examined until step 5 (if it appears soon enough in the file) or step 9 (otherwise).