0
votes

I am trying to migrate lucene tokenizer into apache solr. I have already written TokenizerFactory for each fieldtype like title,body etc on lucene. In lucene, there is a way to add TokenStream to field in a document. In solr We have to make custom Tokenizer/Filter inorder to work with lucene. I am having problem in given area and I have already research on many blog and books which will not solved my problem. In most of blogs and book, They are using string,int direct to the fieldtype.

I have build custom TokenFilterFactory for apache solr and placed in my schema.xml like following

<fieldType name="text_reversed" class="solr.TextField">
<analyzer>
  <tokenizer class="solr.KeywordTokenizerFactory"/>
  <filter class="analyzer.TextWithMarkUpTokenizerFactory"/>
  <filter class="analyzer.ReverseFilterFactory" />
</analyzer>

When I am trying to index document on solr

 TextWithMarkUp textWithMarkUp = //get from method
 SolrInputDocument solrInputDocument = new SolrInputDocument();
 solrInputDocument.addField("id", new Random().nextDouble());
 solrInputDocument.addField("title", textWithMarkUp);

On Apache Solr admin panel result will look following

{
    "id":"0.4470506508669744",
    "title":"com.xyz.data:[text = Several disparities are highlighted in the new report:\n\n74 percent of white male students said they felt like they belonged at school., tokens.size = 24], tokens = [Several] [disparities] [are] [highlighted] [in] [the] [new] [report] [:] [74] [percent] [of] [white] [male] [students] [said] [they] [felt] [like] [they] [belonged] [at] [school] [.] ",
    "_version_":1607597126134530048
}

I am not able to get textWithMarkUp instance on my Custom TokenStream which will blocked me from flatten given object as earlier I have used to do with lucene. In lucene I have used to set instance of textWithMarkUp after creating custom TokenStream instance. Below is my json version of textWithMarkUp instance

{
"text": "The law, which was passed by the Louisiana Legislature and signed by Gov.",
"tokens": [
    {
        "category": "Determiner",
        "canonical": "The",
        "ids": null,
        "start": 0,
        "length": 3,
        "text": "The",
        "order": 0
    },
    //tokenized/stemmed/tagged all the words
],
"abbreviations": [],
"essentialTokenNumber": 12
}

Following code is what I m trying to do

public class TextWithMarkUpTokenizer extends Tokenizer {
    private final PositionIncrementAttribute posIncAtt;
    protected int tokenIndex = -1; // index of the current token in the    collection of metaQTokens
    protected List<MetaQToken> metaQTokens;
    protected TokenStream tokenTokenizer;

    public TextWithMarkUpTokenizer() {
        MetaQTokenTokenizer metaQTokenizer = new MetaQTokenTokenizer();
        tokenTokenizer = metaQTokenizer;
        posIncAtt = addAttribute(PositionIncrementAttribute.class);
    }

    public void setTextWithMarkUp(TextWithMarkUp text) {
      this.markup = text == null ? null : text.getTokens();
    }

    @Override
    public final boolean incrementToken() throws IOException {
      //get instance of TextWithMarkUp here
    }

    private void setCurrentToken(Token token) {
        ((IMetaQTokenAware) tokenTokenizer).setToken(token);
    }
}

I have followed all implementation for TextWithMarkUpTokenizerFactory class, But Solr will have full control on the factory class once we have loaded jar under the lib folder on solr.

So Is there any ways to set given instance during indexing time on solr? I have researched on Update Request Processors. Is there anyway this could be solution for my problem?

1
But you're submitting the string representation of the TextWithMarkup class - which probably is just the text part. Solr doesn't know anything about the "TextWithMarkup" class, and neither does the SolrJ client. You could try to serialize the content as JSON in the field type, then de-serialize it on the other side, or submit the content as its given to the TextWithMarkup class instead, and then do the TextWithMarkup processing as part of your filter?MatsLindh
@MatsLindh can you please tell me more about how to deserialize on other side ?Bibek Shakya
I m unable to retrieve textwithMarkUp instance on my custom filterBibek Shakya
The best solution is to send the content used to create your TextWithMarkup instance to Solr, and then only create the instance there. The other option is to serialize it with JSON or a Java serializer, then unserialize it on the other side. Is there any reason why you can't send the content and then create the TextWithMarkup instance on the Solr side?MatsLindh
In my system, I have delegate all the nlp tasks in ETL pipeline. After ETL operation finished, indexing operation will start. Can you give me more information on which side I will be able to deserialize JSON in solrBibek Shakya

1 Answers

1
votes

Solr search results are absolutely identical to what the indexing system receives. This will be the original input after it is processed by all update processors. The update processor chain that Solr uses by default does not change the input.

Analysis chains defined in your schema will have absolutely no effect on search results - they only affect what tokens are generated at index time and query time. Stored data is unaffected by analysis.

When you do the "addField" with your custom object, chances are that the following SolrJ code is going to be what gets called to figure out what to send to Solr. (val is the input object):

writeVal(val.getClass().getName() + ':' + val.toString());

This creates a string with the name of the class followed by the string representation of the class. As MatsLindh said in a comment, SolrJ doesn't know anything about your custom object, so the data is not going to arrive at Solr as your custom object type.