Indexing markdown documents for full text search in Apache SOLR

Question

I am using Apache SOLR to index markdown documents.
As you know, Markdown is basically plaintext with special tags for formatting like bold and italic. The problem is: if the markdown has bold or italic formatting, the fulltext search is not working. However, if the markdown document has no formatting elements (bold, italic or heading, links etc) - the full text search works. To summarize it works when markdown document is the same as the plain text(i.e no word has any markdown formatting).

I have concluded that I need to convert markdown to plaintext before indexing the documents. Only then the full text search will work as expected in all the cases.

I did some searching and reading on different online forums. I think I need to implement a custom analyzer. The custom analyzer needs to convert the markdown to plaintext first, and then index it. I think this situation is similar to what Apache Tika does for microsoft documents. It parses ms office documents and extracts the plain text. I think I need to similar thing.
I think for markdown documents too - I need to parse and convert to plain text.
I have already found a way to convert markdown to plaintext.

However, I am not sure if I really need to create a custom analyzer. I read some code for custom analyzers - but all of them use tokenFilters. From my understanding, tokenFilters operate on the stream on a token by token basis. In my case, the entire markdown corpus has to be converted to plain text. So, please suggest an approach for this.

Another approach I have thought about this is to first convert the markdown to plaintext and then save the plaintext along with the markdown to the disk. But, I want to avoid this and handle this in SOLR. I expect that SOLR convert it to plain text and then index it.

Should I be creating a custom analyzer for saving the markdown documents to plain text? Or is a custom query parser required?
Can someone give a code example for the same (pseudocode is also fine).

Please help.

I can't see any reason why a StandardTokenizer shouldn't be able to give you proper Markdown-less tokens (as it'll split on and drop most non-alphanumeric characters). What part of the markdown syntax does it barf on? — MatsLindh
@MatsLindh, I figured out the problem. I was using just the WhiteSpaceAnalyzer. WhiteSpaceAnalyzer was just tokenizing based on whitespace but not on special chars like * or ## in markdown. I see that for my usecase - StandardTokenizerFactory is perfect - as the tokenizer will break on whitespace as well as non-alphanumeric chars as mentioned by you. I have done this change - and now the search is working as expected. — Chetan Yewale

MatsLindh MatsLindh · Accepted Answer · 2018-10-01T12:19:42

Use a StandardTokenizer - it'll split on most non-numeric characters, which should be suitable for getting Markdown indexed as single terms, instead of with the Markdown syntax kept intact.

This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:

Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names.

The "@" character is among the set of token-splitting punctuation, so email addresses are not preserved as single tokens.

If you want to split on periods between words as well, you can use a PatternReplaceCharFilterFactory to insert a space after words separated by a dot without whitespace.

Indexing markdown documents for full text search in Apache SOLR

1 Answers