Firstly, Windows-1252 is not a subset of UTF-8. You could argue that ASCII is a subset of UTF-8, but that is usually more of an ideological debate.
Secondly, it is impossible to handle strings with both CP1252 and UTF-8 "characters" in them (really for CP1252 it's a byte and for Unicode it's a code point). Either you try to read it as CP1252, and see all the Unicode characters as single bytes, or you read it as UTF-8 and it cuts out any invalid byte sequences (or creates random characters if the CP1252 characters match a Unicode code point). You are not removing the test character with $c = mb_strcut($c, 1);
, you are removing a question mark created by mb_convert_encoding because it could not convert that Unicode character into a CP1252 character.
Thirdly, you should never convert a String, and then after the fact try to determine the encoding. After you converted your second test string, it was ?COD?
. There is no reason to check if a Unicode character exists in it, because you converted it to CP1252. There can't be Unicode characters in it. As the programmer, you have to know what the output is.
The only solution is to check if the string is CP1252, convert the offending characters to placeholders, and then convert that string to Unicode:
function convert_cp1252_to_utf8($input, $default = '', $replace = array()) {
if ($input === null || $input == '') {
return $default;
}
// https://en.wikipedia.org/wiki/UTF-8
// https://en.wikipedia.org/wiki/ISO/IEC_8859-1
// https://en.wikipedia.org/wiki/Windows-1252
// http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
$encoding = mb_detect_encoding($input, array('Windows-1252', 'ISO-8859-1'), true);
if ($encoding == 'ISO-8859-1' || $encoding == 'Windows-1252') {
/*
* Use the search/replace arrays if a character needs to be replaced with
* something other than its Unicode equivalent.
*/
/*$replace = array(
128 => "€", // http://www.fileformat.info/info/unicode/char/20AC/index.htm EURO SIGN
129 => "", // UNDEFINED
130 => "‚", // http://www.fileformat.info/info/unicode/char/201A/index.htm SINGLE LOW-9 QUOTATION MARK
131 => "ƒ", // http://www.fileformat.info/info/unicode/char/0192/index.htm LATIN SMALL LETTER F WITH HOOK
132 => "„", // http://www.fileformat.info/info/unicode/char/201e/index.htm DOUBLE LOW-9 QUOTATION MARK
133 => "…", // http://www.fileformat.info/info/unicode/char/2026/index.htm HORIZONTAL ELLIPSIS
134 => "†", // http://www.fileformat.info/info/unicode/char/2020/index.htm DAGGER
135 => "‡", // http://www.fileformat.info/info/unicode/char/2021/index.htm DOUBLE DAGGER
136 => "ˆ", // http://www.fileformat.info/info/unicode/char/02c6/index.htm MODIFIER LETTER CIRCUMFLEX ACCENT
137 => "‰", // http://www.fileformat.info/info/unicode/char/2030/index.htm PER MILLE SIGN
138 => "Š", // http://www.fileformat.info/info/unicode/char/0160/index.htm LATIN CAPITAL LETTER S WITH CARON
139 => "‹", // http://www.fileformat.info/info/unicode/char/2039/index.htm SINGLE LEFT-POINTING ANGLE QUOTATION MARK
140 => "Œ", // http://www.fileformat.info/info/unicode/char/0152/index.htm LATIN CAPITAL LIGATURE OE
141 => "", // UNDEFINED
142 => "Ž", // http://www.fileformat.info/info/unicode/char/017d/index.htm LATIN CAPITAL LETTER Z WITH CARON
143 => "", // UNDEFINED
144 => "", // UNDEFINED
145 => "‘", // http://www.fileformat.info/info/unicode/char/2018/index.htm LEFT SINGLE QUOTATION MARK
146 => "’", // http://www.fileformat.info/info/unicode/char/2019/index.htm RIGHT SINGLE QUOTATION MARK
147 => "“", // http://www.fileformat.info/info/unicode/char/201c/index.htm LEFT DOUBLE QUOTATION MARK
148 => "”", // http://www.fileformat.info/info/unicode/char/201d/index.htm RIGHT DOUBLE QUOTATION MARK
149 => "•", // http://www.fileformat.info/info/unicode/char/2022/index.htm BULLET
150 => "–", // http://www.fileformat.info/info/unicode/char/2013/index.htm EN DASH
151 => "—", // http://www.fileformat.info/info/unicode/char/2014/index.htm EM DASH
152 => "˜", // http://www.fileformat.info/info/unicode/char/02DC/index.htm SMALL TILDE
153 => "™", // http://www.fileformat.info/info/unicode/char/2122/index.htm TRADE MARK SIGN
154 => "š", // http://www.fileformat.info/info/unicode/char/0161/index.htm LATIN SMALL LETTER S WITH CARON
155 => "›", // http://www.fileformat.info/info/unicode/char/203A/index.htm SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
156 => "œ", // http://www.fileformat.info/info/unicode/char/0153/index.htm LATIN SMALL LIGATURE OE
157 => "", // UNDEFINED
158 => "ž", // http://www.fileformat.info/info/unicode/char/017E/index.htm LATIN SMALL LETTER Z WITH CARON
159 => "Ÿ", // http://www.fileformat.info/info/unicode/char/0178/index.htm LATIN CAPITAL LETTER Y WITH DIAERESIS
);*/
if (count($replace) != 0) {
$find = array();
foreach (array_keys($replace) as $key) {
$find[] = chr($key);
}
$input = str_replace($find, array_values($replace), $input);
}
/*
* Because ISO-8859-1 and CP1252 are identical except for 0x80 through 0x9F
* and control characters, always convert from Windows-1252 to UTF-8.
*/
$input = iconv('Windows-1252', 'UTF-8//IGNORE', $input);
if (count($replace) != 0) {
$input = html_entity_decode($input);
}
}
return $input;
}
The trick is that you have to check for both ISO-8859-1
and CP1252
because they are so similar. I found this out the hard way after hours of playing around with this function, only to have this answer save me. If you found this function helpful, go +1 that answer.
Basically, this function replaces all those bad CP1252 bytes with HTML entities representing the Unicode characters. We then convert the string from ISO-8859-1
/CP1252
to UTF-8
, while none of our new Unicode characters are mangled because they are simple ASCII characters. Finally, we decode the HTML entities and finally have a 100% Unicode string.