2
votes

I am trying to detect the encoding of a given string in order to convert it later on to utf-8 using iconv. I want to restrict the set of source encodings to utf8, iso8859-1, windows-1251, CP437

//...
$acceptedEncodings = array('utf-8',
    'iso-8859-1',
    'windows-1251'
);

$srcEncoding = mb_detect_encoding($content, $acceptedEncodings, true);

if($srcEncoding)
{
    $content = iconv($srcEncoding, 'UTF-8', $content);
}
//...

The problem is thet mb_detect_encoding does not seem to accept CP437 as a supported encoding and when I give it a CP437 encoded string this is classified as iso-8859-1 which causes iconv to ignore characters like ü.

My question is: Is there a way to detect CP437 encoding earlier? The conversion from CP437 to UTF-8 using iconv works fine but I just cannot find the proper way to detect CP437.

Thank you very much.

1
Where does the string come from? If it comes from a web server then you can read the (optional) charset portion of the Content-Type header. - dotancohen
It's the content of an uploaded .txt file - dncolomer
Then you need an out-of-band method to specify (not detect) the encoding. The only 'automatic' method would be heuristics, and if the file is using any of the non-ASCII codes then it is hopeless. If the file is using only the ASCII codes, then you could just parse it as ASCII. - dotancohen

1 Answers

5
votes

As has been discussed countless times before: it is fundamentally impossible to distinguish any single-byte encoding from any other single-byte encoding. What you get are a bunch of bytes. In encoding A the byte x42 may map to character X and in encoding B the same byte may map to character Y. But nothing about the blob of bytes you have tells you that, because you only have the bytes. They can mean anything. They're equally valid in all encodings. It's possible to identify more complex multi-byte encodings like UTF-8, since they need to follow more complex internal rules. So it's possible to definitely be able to say This is not valid UTF-8. However, it is impossible to say with 100% certainty This is definitely UTF-8, not ISO-8859.

You need to have meta data about the content you receive which tells you what encoding the content is in. It's not practical to identify it after the fact. You'd need to employ actual content analysis to figure out which encoding a piece of text makes the most sense in.