Solr for Arabic

Question

I'm using Solr to index documents in 3 langues(arabic, french and english), I have used this fieldType :

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> 
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

Everything was good, but in arabic language when I put this request to search a word like حقل Solr doen't find the word, but when I put the word in oppositeلقح from left to right Solr find the word and return result.

Can I have result for arabic words ?

I don't know of any mechanism that could reverse the order of RTL text in Solr. Generally, folks find that they want some sort of lemmatization in Arabic to deal with all the inflected forms. What are you using to build the UI that you are typing the search terms into? — bmargulies
I'm using a web page, also in my test I use Eclipse directly with API solrj. — khaled Mabrouk
Are you by any chance extracing your text from PDF files? If so there seems to be a known problem with Tika: issues.apache.org/jira/browse/… — Daniel Rikowski
Thank you Daniel and bmargulies, Yes I'm using Tika to extract text from PDF files, and the result of extracting file was in opposit form, Is there another method to extract data from PDF files? — khaled Mabrouk
We submitted patches to pdfbox that causes it to correctly extract Arabic text. I wonder if Tika has a current copy of PDFbox? Please in any case submit a JIRA at Apache Tika. — bmargulies

bmargulies bmargulies · Accepted Answer · 2011-10-20T12:54:36

I'm going to turn Daniel's clever analysis here to an answer for the record. Don't vote for this, just go find something of his to vote for :-)

There are two ways to get a directionality mismatch with RTL text. You can be indexing it backwards, or you can be querying it backwards. A simple HTML form querying Solr will never mess up directionality. In this care, khaled was extracting text from a PDF using a library that falls victim to the tendency of PDFs to contain 'visual-order' text rather than 'logical order'. So the index was full of backwards Arabic. To fix this, he will have to come up with a working library that extracts text from pdfs.

Forcing Apache Tika to use the latest Apache PDFbox might help, or his PDF may be so quirky that even the latest PDFBox can't handle it. In which case he has a hard problem.

Solr for Arabic

1 Answers