Ignoring diacritics when sorting facet values in Solr 4

Question

Tl;dr: How can I get Solr 4 to ignore diacritics when sorting facet values?

I've added the following four documents to the "collection1" Solr core in the default Solr example:

<doc>
  <field name="id">1</field>
  <field name="cat">manuka</field>
  <field name="cat">mystery</field>
</doc>
<doc>
  <field name="id">2</field>
  <field name="cat">mānuka</field>
  <field name="cat">stuff</field>
</doc>
<doc>
  <field name="id">3</field>
  <field name="cat">management</field>
  <field name="cat">stuff</field>
</doc>
<doc>
  <field name="id">4</field>
  <field name="cat">abc</field>
  <field name="cat">stuff</field>
</doc>

The "cat" field is defined as:

<field name="cat" type="string" indexed="true" stored="true" multiValued="true"/>

and the "string" type is defined as:

<fieldType name="string" class="solr.StrField" sortMissingLast="true" />

When I do a facet query on the "cat" field, sorted by value (http://localhost:8983/solr/collection1/select?q=*%3A*&rows=0&wt=json&indent=true&facet=true&facet.field=cat&facet.sort=index), I get:

....
"facet_fields":{
  "cat":[
    "abc",1,
    "management",1,
    "manuka",1,
    "mystery",1,
    "mānuka",1,
    "stuff",3]},
....

Note that mānuka comes after mystery. I'd like to have mānuka come after manuka and before stuff, that is, I'd like the sort to ignore diacritics including the macron.

If this was a non-facet search, it looks like I could achieve what I want by setting up Collation for a separate copy field and sort by that (I can't set up collation for the field itself because the stored data will be a binary representation of the collation key). However, it looks like this approach isn't possible for facet queries since they can only be sorted by index or count.

Am I overlooking something? Is there some trick to get this working in an environment where I do need to display the value of the "cat" field?

Karsten R. Karsten R. · Accepted Answer · 2016-02-29T17:03:33

The question is about customizing the index-order of a facet.

Your suggestion is to use Collation. You can do this and the order of your facets will be correct. The problem is that neither CollationField nor ICUCollationField are overriding the indexedToReadable method.

The two classes cannot override indexedToReadable because in general the mapping from word to term is not invertible. But for your case possible you can implemenent a subclass of ICUCollationField which overrides indexedToReadable in a sencefull way.

Your starting point could be TestICUCollationField with

    <fieldType name="sort_fr_t" class="solr.ICUCollationField" locale="fr" strength="primary"/>
    ...
    <field name="sort_fr" type="sort_fr_t" indexed="true" stored="true" docValues="true" multiValued="true"/>

as you will see in this case the names of the facet values are very unreadable.

Ignoring diacritics when sorting facet values in Solr 4

1 Answers