2
votes

I need to save the user agent of a visiting user.

This is my snippet:

// User Agent
$ua = $_SERVER['HTTP_USER_AGENT']??'';
$ua_md5 = md5($ua);
// Search if the UA already exists in the user_agents table
$ua_id = $db->query("SELECT id FROM user_agents WHERE md5='".$ua_md5."';")->fetchColumn();

if(!$ua_id) {
    // If it doesn't exist, insert it and get its id
    $db->query("INSERT INTO user_agents (md5, user_agent) VALUES ('$ua_md5', ".$db->quote($ua).")");
    $ua_id = $db->lastInsertId();
}

I use PDO:quote instead of prepared statements only for performance reasons (it's faster and this script runs thousands of times per second).

It happens that some users has this user agent:

Mozilla/5.0 (Linux; Android 5.0; \xD6wn Smart Build/LRX21M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/37.0.0.0 Mobile Safari/537.36

And the insert fails for this error:

"PHP message: SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xD6wn Sm...' for column 'user_agent' at row 1" while reading response header from upstream

What is the reason and how can it be fixed?


Edit: more debugging found out the $ua value is:

Mozilla/5.0 (Linux; Android 5.0; ÖWN S1 Build/LRX21M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/37.0.0.0 Mobile Safari/537.36

1
Your table is in the wrong character encoding it sounds like. Also using parameterized queries is a better approach.chris85
You're already using PDO, why not make the extra step and move to prepared statements?Alon Eitan
In PHP, "\xD6" is a escape sequence so you don't have a literal backslash anywhere. You have a D6 byte that, in UTF-8, is not a valid character—by pure chance, MySQL converts it to hexadecimal in order to display it..Álvaro González
OK, that's different than your sample in the question then. Then let's see some debugging, preferably a var_dump of the query you're trying to execute with the supposedly quoted value.deceze
Ah wait. May be it's indeed a malformed utf. Well you can try to fix the string using iconvYour Common Sense

1 Answers

3
votes

First of all, when it comes to encoding issues it's always necessary to inspect raw bytes and never rely of the output of some process that renders or interpret the input. In the case of PHP, var_dump() is always a good starting point but you also need to dump to hex or even resort to an hexadecimal editor:

<?php
var_dump(bin2hex($_SERVER['HTTP_USER_AGENT']));

My educated guess so far (and I don't believe it's too far from truth) is:

  1. Some Android browser is sending an HTTP header that includes Öwn Smart Build encoded in ISO-8859-1, where Ö aka 'LATIN CAPITAL LETTER O WITH DIAERESIS' (U+00D6) is encoded as D6.

  2. Your app stack is configured for UTF-8 (the sensible choice), where Ö would be encoded as C396.

  3. PHP doesn't know/care, because PHP strings aren't encoding-aware (they're just byte streams).

  4. MySQL is handled D6, being told it's UTF-8 (but it isn't).

  5. Had D6 been a valid UTF-8 (different) character, or part of a (different) multi-byte sequence, insertion would have completed with minor data loss (the original Öwn text would have been lost and replaced by something else). For good or bad, it isn't valid UTF-8, thus MySQL aborts the insertion with the error message you describe.

Why can't MySQL cope with it? Let's check the UTF-8 definition:

Nr of
Bytes    Byte 1   Byte 2       Byte 3     Byte 4
1      0xxxxxxx       
2      110xxxxx   10xxxxxx     
3      1110xxxx   10xxxxxx   10xxxxxx   
4      11110xxx   10xxxxxx   10xxxxxx   10xxxxxx

Original Latin-1 Öw text is encoded as D6 77 which, translated to binary, is:

11010110 01001101
^^^      ^^

In UTF-8, 110… means "Start of 2-byte character". Second byte should then start with 10… but we have 01… instead. Oops!

How can you solve this? It's trickier than it seems. If you could know for sure that input is ISO-8859-1, it's just a straightforward conversion:

<?php
$input = "\xD6wn";
$output = mb_convert_encoding($input, 'UTF-8', 'ISO-8859-1');
var_dump(bin2hex($input), bin2hex($output));
string(6) "d6776e"
string(8) "c396776e"

(Online demo)

But, how can you know? I'm not sure if the User-Agent header allows MIME Encoded-Words and, even there, the browser may just send invalid data anyway. Perhaps you can catch the error (MySQL error code 1366 aka ER_TRUNCATED_WRONG_VALUE_FOR_FIELD looks reasonably precise) and try again assuming ISO-8859-1. And it's probably also a good idea to verify that input is valid UTF-8, though it can make processing kind of bothersome:

<?php
$latin1 = "\xD6wn";
$utf8 = "\xc3\x96wn";
var_dump(mb_check_encoding($latin1, 'UTF-8'));
var_dump(mb_check_encoding($utf8, 'UTF-8'));
bool(false)
bool(true)

(Online demo)