3
votes

I am putting UTF-8 encoded data into a database table configured to use the utf8 charset, but when I then do a fulltext search it is not matching the word before non-breaking spaces.

For example, for formatting issues, we have a non-breaking space in hepatitis B. This string is not matched when searching for hepatitis.

CREATE TABLE `search` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `title` text COLLATE FULLTEXT KEY `title` (`title`),
  PRIMARY KEY (`id`),
  FULLTEXT KEY `title` (`title`),
) ENGINE=MyISAM AUTO_INCREMENT=202337 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

This query returns nothing:

SELECT 
  title, 
  MATCH(title) AGAINST ('hepatitis') AS `titleScore` 
FROM 
  `search` 
WHERE 
  MATCH(title) AGAINST ("hepatitis")
ORDER BY 
  `titleScore` DESC LIMIT 10;

But this query returns the following:

SELECT
  title
FROM
  search
WHERE
  title LIKE "%hepatitis%";

+-------------------------------------------------------------------------+
| title                                                                   |
+-------------------------------------------------------------------------+
| Comparison of drugs for chronic HBeAg-positive hepatitis B             |
| Antivirals in chronic hepatitis C                                      |
| Chronic hepatitis C                                                    |
| Antivirals for hepatitis C                                             |
| Antivirals for hepatitis B                                             |
| Other antivirals for hepatitis C                                       |
| Chronic hepatitis B                                                    |
| Hepatitis A vaccine                                                    |
| Hepatitis B vaccine                                                    |
| Hepatitis B immunoglobulin                                             |
| Hepatitis C virus protease inhibitors, see  HCV-protease inhibitors    |
+-------------------------------------------------------------------------+

According to "The Full-Text Stuff That We Didn't Put In The Manual" at http://ftp.nchu.edu.tw/MySQL/tech-resources/articles/full-text-revealed.html#breaking full-text should only match alphanumerics as word elements, and therefore break at the non-breaking space (although it doesn't explicitly state the non-breaking space character itself).

I did find a comment on the MySQL manual - http://dev.mysql.com/doc/refman/5.5/en/fulltext-search.html

To make FULLTEXT MATCH work with Japanese UTF-8 text, be careful that words from your Japanese text be separated by the ASCII space character, not Japanese UTF-8 (or other) spacing characters. (when using phpMyAdmin to manage data/ write a SQL query, you must switch away from your Japanese IME to insert a space char...)

I've created a new collation, following the MySQL manual, with the following rules:

<charset name="utf8">
  ...
  <collation name="utf8_custom" id="1001">
    <rules>
      <reset>\u0020</reset> <!-- ascii space character -->
      <i>\u00A0</i>         <!-- non-breaking space -->
      <reset>A</reset>      <!-- test -->
      <i>B</i>
    </rules>
  </collation>
</charset>

I restarted the server and then confirmed the collation was available with show collation like 'utf8_custom';

I then altered the table to use the new collation and rebuilt the indices using repair table for good measure.

SELECT title FROM search WHERE "Hepatitis A vaccine"; still does not return results

SELECT title FROM search WHERE "Hepatitis A vaccine"; does return results - two, in fact:

 +------------------------+
 | title                  |
 +------------------------+
 | Hepatitis A vaccine   |
 | Hepatitis B vaccine   |
 +------------------------+

This shows that the rule in the collation making B identical to A is being honoured, but the non-breaking space is not.

The  is bothering me - my table is utf8, and my client is utf8, and the source data is utf8. I'm not sure I should be seeing this character.

1
possible duplicate of MySQL Full-Text search for hashtags (including the # symbol in index) (the other question asked how to treat a word separator as a regular character, which is the exact opposite of your question).RandomSeed
nice, thanks - I think creating a new collation is probably the way we'll go but with a looming deadline I've just filtered non-ascii characters from the search corpusHorusKol

1 Answers

1
votes

The problem is the step where the search data is being written to the database - I had to issue a SET NAMES "utf8" (or the Zend/PDO equivalent) to make sure that the utf8 string being sent to the utf8 table was transported as utf8.

Adding a parameter to my database config in my Zend application.ini for charset = 'utf8' solved this.