We recently ran into some troubles when trying to determine the correct encoding used for a page. We have encounter a page with following setup:
header response:
Content-Type:text/html; charset=GBK
meta tag:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Actual content is in GBK, modern browsers are smart enough to use the right encoding for this page.
But for a crawler (using curl), we are forced to decide picking one charset value over the other. So my question is: Is taking header charset over meta charset the normal thing to do?
(Most content-based encoding detection algorithm we have tried are shaky at best, as long as one charset is more reliable than the other, we prefer using specified charset over anything from our own encoding detection.)