4
votes

I am trying to implement an elasticsearch mapping to optimize phrase search in a large body of text. As per the suggestions in this article, I am using a shingle filter to build multiple unigrams per phrase.

Two questions:

  1. In the article mentioned, the stopwords are filtered and the shingles take care of the missing spaces by inserting "_" tokens. These tokens should be eliminated from the unigram that is indexed by the engine. The point of this elimination is be able to respond to phrase queries that contain all sorts of "useless" words. The standard solution (as mentioned in the article), is no longer possible, given that Lucene is deprecating a certain feature (enable_position_increments) needed for this kind of behaviour. How do I solve this kind of issue?

  2. Given the elimination of punctuation, I routinely see unigrams resulting from this shingling process that cover both phrases. From the point of view of search, any result that contains words from two separate phrases is not correct. How do I avoid (or mitigate) this kind of issues?

Did you find any solutions for your problem? I'm currently having the same problem with shingles and looking for a solution.paweloque