I am trying to migrate lucene tokenizer into apache solr. I have already written TokenizerFactory
for each fieldtype like title,body etc on lucene. In lucene, there is a way to add TokenStream to field in a document. In solr We have to make custom Tokenizer/Filter inorder to work with lucene. I am having problem in given area and I have already research on many blog and books which will not solved my problem. In most of blogs and book, They are using string,int direct to the fieldtype.
I have build custom TokenFilterFactory for apache solr and placed in my schema.xml like following
<fieldType name="text_reversed" class="solr.TextField">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="analyzer.TextWithMarkUpTokenizerFactory"/>
<filter class="analyzer.ReverseFilterFactory" />
</analyzer>
When I am trying to index document on solr
TextWithMarkUp textWithMarkUp = //get from method
SolrInputDocument solrInputDocument = new SolrInputDocument();
solrInputDocument.addField("id", new Random().nextDouble());
solrInputDocument.addField("title", textWithMarkUp);
On Apache Solr admin panel result will look following
{
"id":"0.4470506508669744",
"title":"com.xyz.data:[text = Several disparities are highlighted in the new report:\n\n74 percent of white male students said they felt like they belonged at school., tokens.size = 24], tokens = [Several] [disparities] [are] [highlighted] [in] [the] [new] [report] [:] [74] [percent] [of] [white] [male] [students] [said] [they] [felt] [like] [they] [belonged] [at] [school] [.] ",
"_version_":1607597126134530048
}
I am not able to get textWithMarkUp instance on my Custom TokenStream which will blocked me from flatten given object as earlier I have used to do with lucene. In lucene I have used to set instance of textWithMarkUp after creating custom TokenStream instance. Below is my json version of textWithMarkUp instance
{
"text": "The law, which was passed by the Louisiana Legislature and signed by Gov.",
"tokens": [
{
"category": "Determiner",
"canonical": "The",
"ids": null,
"start": 0,
"length": 3,
"text": "The",
"order": 0
},
//tokenized/stemmed/tagged all the words
],
"abbreviations": [],
"essentialTokenNumber": 12
}
Following code is what I m trying to do
public class TextWithMarkUpTokenizer extends Tokenizer {
private final PositionIncrementAttribute posIncAtt;
protected int tokenIndex = -1; // index of the current token in the collection of metaQTokens
protected List<MetaQToken> metaQTokens;
protected TokenStream tokenTokenizer;
public TextWithMarkUpTokenizer() {
MetaQTokenTokenizer metaQTokenizer = new MetaQTokenTokenizer();
tokenTokenizer = metaQTokenizer;
posIncAtt = addAttribute(PositionIncrementAttribute.class);
}
public void setTextWithMarkUp(TextWithMarkUp text) {
this.markup = text == null ? null : text.getTokens();
}
@Override
public final boolean incrementToken() throws IOException {
//get instance of TextWithMarkUp here
}
private void setCurrentToken(Token token) {
((IMetaQTokenAware) tokenTokenizer).setToken(token);
}
}
I have followed all implementation for TextWithMarkUpTokenizerFactory
class, But Solr will have full control on the factory class once we have loaded jar under the lib folder on solr.
So Is there any ways to set given instance during indexing time on solr? I have researched on Update Request Processors. Is there anyway this could be solution for my problem?
TextWithMarkup
class - which probably is just thetext
part. Solr doesn't know anything about the "TextWithMarkup" class, and neither does the SolrJ client. You could try to serialize the content as JSON in the field type, then de-serialize it on the other side, or submit the content as its given to the TextWithMarkup class instead, and then do the TextWithMarkup processing as part of your filter? – MatsLindh