I am putting UTF-8 encoded data into a database table configured to use the utf8 charset, but when I then do a fulltext search it is not matching the word before non-breaking spaces.
For example, for formatting issues, we have a non-breaking space in hepatitis B. This string is not matched when searching for hepatitis.
CREATE TABLE `search` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`title` text COLLATE FULLTEXT KEY `title` (`title`),
PRIMARY KEY (`id`),
FULLTEXT KEY `title` (`title`),
) ENGINE=MyISAM AUTO_INCREMENT=202337 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
This query returns nothing:
SELECT
title,
MATCH(title) AGAINST ('hepatitis') AS `titleScore`
FROM
`search`
WHERE
MATCH(title) AGAINST ("hepatitis")
ORDER BY
`titleScore` DESC LIMIT 10;
But this query returns the following:
SELECT
title
FROM
search
WHERE
title LIKE "%hepatitis%";
+-------------------------------------------------------------------------+
| title |
+-------------------------------------------------------------------------+
| Comparison of drugs for chronic HBeAg-positive hepatitis B |
| Antivirals in chronic hepatitis C |
| Chronic hepatitis C |
| Antivirals for hepatitis C |
| Antivirals for hepatitis B |
| Other antivirals for hepatitis C |
| Chronic hepatitis B |
| Hepatitis A vaccine |
| Hepatitis B vaccine |
| Hepatitis B immunoglobulin |
| Hepatitis C virus protease inhibitors, see HCV-protease inhibitors |
+-------------------------------------------------------------------------+
According to "The Full-Text Stuff That We Didn't Put In The Manual" at http://ftp.nchu.edu.tw/MySQL/tech-resources/articles/full-text-revealed.html#breaking full-text should only match alphanumerics as word elements, and therefore break at the non-breaking space (although it doesn't explicitly state the non-breaking space character itself).
I did find a comment on the MySQL manual - http://dev.mysql.com/doc/refman/5.5/en/fulltext-search.html
To make FULLTEXT MATCH work with Japanese UTF-8 text, be careful that words from your Japanese text be separated by the ASCII space character, not Japanese UTF-8 (or other) spacing characters. (when using phpMyAdmin to manage data/ write a SQL query, you must switch away from your Japanese IME to insert a space char...)
I've created a new collation, following the MySQL manual, with the following rules:
<charset name="utf8">
...
<collation name="utf8_custom" id="1001">
<rules>
<reset>\u0020</reset> <!-- ascii space character -->
<i>\u00A0</i> <!-- non-breaking space -->
<reset>A</reset> <!-- test -->
<i>B</i>
</rules>
</collation>
</charset>
I restarted the server and then confirmed the collation was available with show collation like 'utf8_custom';
I then altered the table to use the new collation and rebuilt the indices using repair table for good measure.
SELECT title FROM search WHERE "Hepatitis A vaccine";
still does not return results
SELECT title FROM search WHERE "Hepatitis A vaccine";
does return results - two, in fact:
+------------------------+
| title |
+------------------------+
| Hepatitis A vaccine |
| Hepatitis B vaccine |
+------------------------+
This shows that the rule in the collation making B identical to A is being honoured, but the non-breaking space is not.
The Â
is bothering me - my table is utf8, and my client is utf8, and the source data is utf8. I'm not sure I should be seeing this character.