4
votes

I'm implementing a search system onto my django project, using django haystack. The problem is that some fields in my models have some french accents, and I would like to find the entries which contents the query with and without accents.

I think the best Idea is to create a SearchIndex with both the fields with the accents, and the same field without the accents.

Any idea or hint on this ?

Here is some code

Imagine the following models :

Cars(models.Model):
    name = models.CharField()

and the following Haystack Index:

Cars(indexes.SearchIndex):
    name = indexes.CharField(model_attr='name')
    cleaned_name = indexes.CharField(model_attr='name')

    def prepare_cleaned_name(self, object):
        return strip_accents(object.name)

now, in my index template, I put the both fields :

{{ object.cleaned_name }}
{{ object.name }}

So, thats some pseudo code, I don't know if it works, but if you have any idea on this, let me know !

3
i'm not sure, but if you are using solr like backend, you can make the query with the '~' at the end of the query and this will give you fuzzy results without care the accent. - diegueus9
I prefer to have a solution which do not depends on the backend. Thanks anyway - dzen
I think what you're after is called "Character Folding", and although it has different setup depending on the backend, the setup is very simple. I've explained how to set it up for solr and whoosh here: gregbrown.co.nz/code/haystack-character-folding - Greg

3 Answers

4
votes

I find a way to index both value from the same field in my Model.

First, write a method in your model which returns the ascii value of the fields:

class Car(models.Model):
    name = model.CharField()

    def ascii_name(self):
        return strip_accents(self.name)

So that in your template used to generate the index, you could do this:

{{ object.name }}
{{ object.ascii_name }}

Then, you just have to rebuild your indexes !

3
votes

Yes, you're on the right track here. Sometimes you do want to store fields multiple times, with different transformations applied.

An example of this in my application is that I have two title fields. One for searching which gets stemmed (the process by which test ~= test ~= tester), and another for sorting which is left alone (the stemming interferes with the sort order).

This is an analogous case.

In my schema.xml this is handled by:

<field name="title" type="text" indexed="true" stored="true" multiValued="false" />
<field name="title_sort" type="string" indexed="true" stored="true" multiValued="false" />

The type "string" is responsible for storing the "as-is" version of the title.

By the way, it you're stripping accents just to make words easier to search for, this is something that might be worth looking into: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ISOLatin1AccentFilterFactory

1
votes

You must do something like follow:

Cars(indexes.SearchIndex):
    name = indexes.CharField(model_attr='name')

    def prepare(self, obj):
        self.prepared_data = super(Cars, self).prepare(obj)
        self.prepared_data['name'] += '\n' + strip_accents(self.prepared_data['name'])
        return self.prepared_data

I don't like this solution. I would like to know some way to configure my seach backend to do it for me. I use whoosh.