3
votes

So I am making a search engine for a site using Zend_Search_Lucene

I am currently using Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive which works fine, except for one thing: it makes distinctions between accented and not accented characters

In google (and other search engines) when you search for "χιονι" it will return results for all variations of it, like "χιόνι" which is the correct accented version in greek (χιόνι = snow btw). In lucene (in general, not only Zend_Search_Lucene) this is not default or even bundled behavior from what I've seen

My first attempt for a solution was to do kind of what lucene does for case insensitive search - analyzers, remove accents from letters the same way case insensitive analyzers simply make everything lowercase during indexing & searching (ie $str = strtr($str, 'ό', 'ο'))

The only reason this failed is because php does not have an mb_strtr and strtr does not work for multibyte characters like this, and preg_replace just didn't work either

Is there a way to make lucene search in "accent-insensitive" mode (an analyzer probably?), or alternatively a way to unaccent multibyte characters in php (I also did search on this with no results)?

Mind that what I want to search for is not western-european accented characters for which there are some unaccent solutions for php on the web

1

1 Answers

2
votes

Have you tried normalizer_normalize to remove diacritics from text: How to remove diacritics from text?

You can also use $str = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $str);

You can then create a token filter (by extending Zend_Search_Lucene_Analysis_TokenFilter) to normalize your keywords.

I don't know if it works for your encoding.