2
votes

I'm trying to select a search tool for a large project, and I'd be interested to know if this use case was supported by Solr or ElasticSearch.

My customers are interested in conducting relatively sophisticated boolean searching. One search that is a must is the ability to conduct proximity searches on phrases with root expanders.

For example, imagine a user searching for a document with this phrase: "The cute dog was attacked by evil raccoons"

I'd like the user to be able to search for "evil rac*" within 5 words of "dog" and return a document with the above sentence. Ideally, a query would look something like:

("evil rac*" dog)~5

So far, the only search tool I've found that can do what I'm looking for is dtSearch. The query for dtSearch would be "evil rac*" w/5 dog, which is great. I'd rather use an open source tool like Solr or ElasticSearch (and especially a hosted solution such as websolr or bonsai.io). Any advice would be very much appreciated.

3
Hey Jake, Nick with websolr/bonsai here. Do you have some more examples of the queries you've tried and their resulting behavior? I suspect quoting drops or otherwise treats the asterisk literally.Nick Zadrozny

3 Answers

2
votes

It's certainly technically possible to do this with a custom query parser, but the default, dismax, etc parsers in solr don't appear to support this. There's an old and unresolved issue about this: https://issues.apache.org/jira/browse/SOLR-1604.

ElasticSearch would only support this with the JSON query builder, but it appears that the phrase-like query support is only for "span_term"s, which are just simple words.

There's some talk of the default query parsers being more clever in the near future.

1
votes

Definitely technically possible, but as of yet unsupported in Lucene. There are a few open issues to support "complex phrase" behavior in Lucene, which seems to be targeted at Lucene 4.3:

LUCENE-1486 — An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.

I don't see your specific query structure in their examples there, but this is definitely a lot closer than what's available today.

To recap: theoretically feasible, not supported in syntax as of April 2013 and Lucene 4.2.1.

(Hat tip to my business partner, Kyle, for help researching this.)

0
votes

It is possible but...

1) First, check http://wiki.apache.org/solr/SurroundQueryParser (http://searchhub.org/2009/02/22/exploring-query-parsers/) for surround query parser. It is almost exactly what you want. However, people claim (at least in some places) that it supports phrase queries but that is not true (yet).

2) So you have to implement the phrase proximity. A (nasty) hack is to update DistanceQuery::getSpanNearQuery (Line 78 in solr 4.2.1 in lucene/queryparser/.../DistanceQuery.java)

while (sqi.hasNext()) {
  SpanNearClauseFactory sncf = new SpanNearClauseFactory(reader, fieldName, qf);

  // HACK starts here 
  DistanceSubQuery dsq = ((DistanceSubQuery)sqi.next());
  try {
    if ( ((SrndTermQuery)dsq).getTermText().contains( " " ) ) {
      String term_text = ((SrndTermQuery)dsq).getTermText();
      String[] tokens = term_text.split("\\s+");
      SpanQuery[] span_queries = new SpanQuery[tokens.length];
      for ( int i = 0; i < tokens.length; ++i ) {
        span_queries[i] = new SpanTermQuery( new Term(fieldName, tokens[i]) );
      }
      spanClauses[qi] = new SpanNearQuery( span_queries, 0, true);
      qi++;
      continue;
    }
  }catch( Exception ex ){
  }
  // HACK ends here 

  dsq.addSpanQueries(sncf);

3) And be careful that there is no preprocessing of the data so if you use stemming you have to search for exact the words e.g., select?q={!surround df=text}"we defin" 11w "descend" will match """ we define a set of words sorted in descending """