I am using Apache SOLR to index markdown documents.
As you know, Markdown is basically plaintext with special tags for formatting like bold and italic.
The problem is: if the markdown has bold or italic formatting, the fulltext search is not working. However, if the markdown document has no formatting elements (bold, italic or heading, links etc) - the full text search works. To summarize it works when markdown document is the same as the plain text(i.e no word has any markdown formatting).
I have concluded that I need to convert markdown to plaintext before indexing the documents. Only then the full text search will work as expected in all the cases.
I did some searching and reading on different online forums. I think I need to implement a custom analyzer. The custom analyzer needs to convert the markdown to plaintext first, and then index it.
I think this situation is similar to what Apache Tika
does for microsoft documents. It parses ms office documents and extracts the plain text.
I think I need to similar thing.
I think for markdown documents too - I need to parse and convert to plain text.
I have already found a way to convert markdown to plaintext.
However, I am not sure if I really need to create a custom analyzer. I read some code for custom analyzers - but all of them use tokenFilters
. From my understanding, tokenFilters
operate on the stream on a token by token basis. In my case, the entire markdown
corpus has to be converted to plain text
. So, please suggest an approach for this.
Another approach I have thought about this is to first convert the markdown to plaintext and then save the plaintext along with the markdown to the disk. But, I want to avoid this and handle this in SOLR. I expect that SOLR convert it to plain text and then index it.
- Should I be creating a
custom analyzer
for saving themarkdown
documents toplain text
? Or is acustom query parser
required? - Can someone give a code example for the same (pseudocode is also fine).
Please help.
WhiteSpaceAnalyzer
.WhiteSpaceAnalyzer
was just tokenizing based on whitespace but not on special chars like * or ## in markdown. I see that for my usecase -StandardTokenizerFactory
is perfect - as the tokenizer will break on whitespace as well as non-alphanumeric chars as mentioned by you. I have done this change - and now the search is working as expected. – Chetan Yewale