1
votes

I have a CCK type for storing mentions (Social Media search mentions). Some of the mentions I believe are ASCII (My knowledge of this stuff is little).

I retrieve data from API's, which I then using node_save to save to Drupal.

My question is, what should I use to safely convert whatever I am getting into a format Drupal and MySQL are happy with?

The particular db_query error I get is unhelpfull "Warning in test1\includes\common.inc on line 3538". Nice. I have traced it to be encoding, as I used the following code to make the input safe, but it is not working with all input.

$node->title = htmlentities($item['title'], ENT_COMPAT, 'UTF-8');

It worked well for some ASCII characters, like those square ones [] etc, but not for this "行けなくてもずっとユーミンは聴きつづけます".

I'm really stuck. :(

UPDATE: The EXACT error I get from PHP is "Warning in D:\sites\test1\includes\common.inc on line 3538", and the line reads "if (db_query($query, $values)) {".

UPDATE 2: I've confirmed that the encoding of the data I am receiving is UTF8. This really doesn't make sense now, and I've confirmed that the collation in the db is utf8_general_ci.

UPDATE 3: One of the title's is: How Much Does A Facebook Fan Cost?� $1.07

The output of:

var_export(array_map('ord', str_split($node->title))

gave me the character 160 for the funny question mark (which is a square like [] in eclipse).

UPDATE 4: MySQL version is 5.1.41, and the collation on the columns is utf8_general_ci.

UPDATE 5: I managed to get Drupal to print the query with db_queryd. Funny thing is now I get the exact error message and not "Warning in", but Drupal still doesn't have this error in the log! WTF. So the exact sql is:

INSERT INTO node (vid, type, language, title, uid, status, created, changed, comment, promote, moderate, sticky, tnid, translate) VALUES (0, 'sm_mention', '', 'How Much Does A Facebook Fan Cost?� $1.07 (Geoffrey A. Fowler/Digits)', 1, 1, 1298395302, 1298395302, 0, 0, 0, 0, 0, 0)

And the error given is: Incorrect string value: '\xA0 $1.0...' for column 'title' at row 1

This honestly sounds like something doesn't like extended ascii characters.

UPDATE 6:

 SHOW CREATE TABLE node: 

   CREATE TABLE `node` (
  `nid` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `vid` int(10) unsigned NOT NULL DEFAULT '0',
  `type` varchar(32) NOT NULL DEFAULT '',
  `language` varchar(12) NOT NULL DEFAULT '',
  `title` varchar(255) NOT NULL DEFAULT '',
  `uid` int(11) NOT NULL DEFAULT '0',
  `status` int(11) NOT NULL DEFAULT '1',
  `created` int(11) NOT NULL DEFAULT '0',
  `changed` int(11) NOT NULL DEFAULT '0',
  `comment` int(11) NOT NULL DEFAULT '0',
  `promote` int(11) NOT NULL DEFAULT '0',
  `moderate` int(11) NOT NULL DEFAULT '0',
  `sticky` int(11) NOT NULL DEFAULT '0',
  `tnid` int(10) unsigned NOT NULL DEFAULT '0',
  `translate` int(11) NOT NULL DEFAULT '0',
  PRIMARY KEY (`nid`),
  UNIQUE KEY `vid` (`vid`),
  KEY `node_changed` (`changed`),
  KEY `node_created` (`created`),
  KEY `node_moderate` (`moderate`),
  KEY `node_promote_status` (`promote`,`status`),
  KEY `node_status_type` (`status`,`type`,`nid`),
  KEY `node_title_type` (`title`,`type`(4)),
  KEY `node_type` (`type`(4)),
  KEY `uid` (`uid`),
  KEY `tnid` (`tnid`),
  KEY `translate` (`translate`)
) ENGINE=InnoDB AUTO_INCREMENT=1700 DEFAULT CHARSET=utf8
1
In order to answer this question you need to post the exact error and also whatever is on line 3538 'cos it's hard to guess... it's somewhere in drupal_write_record, isn't it.chx
yeah... that is the EXACT error. I was shocked myself. The line on 3538 reads: if (db_query($query, $values)) {lordg
file_put_contents('/tmp/log', var_export(array_map('ord', str_split($node->title)), TRUE)); or something like that. copy-paste is very unlikely to work... having the characters themselves would help. Then we can begin to look for the problem. Also give us MySQL version and check your tables and columns to have some utf8 collation.chx
Thanks chx, I've updated the question.lordg
Do you have a virtual machine with ubuntu, or some other modern Linux version around for testing? I have seen similar issues arraise on windows machines several times. Without finding a proper solution, to be fair; the only solution then being "probably some windows incompatability, but we deploy on Linux, so let's not spend too much time debugging MS windows" :)berkes

1 Answers

3
votes

\xA0 is not a valid start of a UTF8 sequence.

The character known as NO-BREAK SPACE having the Unicode codepoint 0x00A0 should be encoded as 0xC2A0 in UTF8.

Thus said, your input string is broken, it's not a valid UTF8.