1
votes

I'm using Lucene 6.3, but I am not able to figure out what is wrong with the following very basic search query. It simply adds to documents each with a single date range and then tries to search on a greater range the should find both documents. What is wrong?

There are inline comments which should make the exmaple pretty self explanatory. Let me know if anything is unclear.

Please note that my main requirement is being able to to perform date range query along side other field queries such as

text:interesting date:[2014 TO NOW]

This is after watching the Lucene spatial deep dive video introduction which introduces the framework on which DateRangePrefixTree and strategies are based.

Rant: It feels like if I am making any mistakes here that I should get some validation errors, either on the query or on the writing, given how simplistic my example is.

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.*;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.spatial.prefix.NumberRangePrefixTreeStrategy;
import org.apache.lucene.spatial.prefix.PrefixTreeStrategy;
import org.apache.lucene.spatial.prefix.tree.DateRangePrefixTree;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.junit.Before;
import org.junit.Test;

import java.io.IOException;
import java.util.Calendar;
import java.util.Date;


public class TestLuceneDatePrefix {

  /*
  All these names should be lower case as field names are case sensitive in Lucene.
   */
  private static final String NAME = "name";
  public static final String TIME = "time";


  private Directory directory;
  private StandardAnalyzer analyzer;
  private ScoreDoc lastDocOnPage;
  private IndexWriterConfig indexWriterConfig;

  @Before
  public void setup() {
    analyzer = new StandardAnalyzer();
    directory = new RAMDirectory();
    indexWriterConfig = new IndexWriterConfig(analyzer);
  }


  @Test
  public void testAddDocumentAndSearchByDate() throws IOException {

    IndexWriter w = new IndexWriter(directory, new IndexWriterConfig(analyzer));

    // Responsible for creating the prefix string / geohash / token to identify the date.
    // aka Create post codes
    DateRangePrefixTree prefixTree = new DateRangePrefixTree(DateRangePrefixTree.JAVA_UTIL_TIME_COMPAT_CAL);

    // Strategy indexing the token.
    // aka transform post codes into tokens that make them efficient to search.
    PrefixTreeStrategy strategy = new NumberRangePrefixTreeStrategy(prefixTree, TIME);


    createDocument(w, "Bill", new Date(2017,1,1), prefixTree, strategy);
    createDocument(w, "Ted", new Date(2018,1,1), prefixTree, strategy);

    w.close();

    // Written the document, now try query them

    DirectoryReader reader;
    try {
      QueryParser queryParser = new QueryParser(NAME, analyzer);
      System.out.println(queryParser.getLocale());

      // Surely searching only on year for the easiest case should work?
      Query q = queryParser.parse("time:[1972 TO 4018]");

      // The following query returns 1 result, so Lucene is set up.
      // Query q = queryParser.parse("name:Ted");
      reader = DirectoryReader.open(directory);
      IndexSearcher searcher = new IndexSearcher(reader);

      TotalHitCountCollector totalHitCountCollector = new TotalHitCountCollector();

      int hitsPerPage = 10;
      searcher.search(q, hitsPerPage);

      TopDocs docs = searcher.search(q, hitsPerPage);
      ScoreDoc[] hits = docs.scoreDocs;

      // Hit count is zero and no document printed!!

      // Putting a dependency on mockito would make this code harder to paste and run.
      System.out.println("Hit count : "+hits.length);
      for (int i = 0; i < hits.length; ++i) {
        System.out.println(searcher.doc(hits[i].doc));
      }
      reader.close();
    }
    catch (ParseException e) {
      e.printStackTrace();
    }
  }


  private void createDocument(IndexWriter w, String name, Date fromDate, DateRangePrefixTree prefixTree, PrefixTreeStrategy strategy) throws IOException {
    Document doc = new Document();

    // Store a text/stored field for the name. This helps indicate that Lucene is orking.
    doc.add(new TextField(NAME, name, Field.Store.YES));

    //offset toDate
    Calendar cal = Calendar.getInstance();
    cal.setTime( fromDate );
    cal.add( Calendar.DATE, 1 );
    Date toDate = cal.getTime();

    // This lets the prefix tree create whatever tokens it needs
    // perhaps index year, date, second etc separately, hence multiple potential tokens.
    for (IndexableField field : strategy.createIndexableFields(prefixTree.toRangeShape(
        prefixTree.toUnitShape(fromDate), prefixTree.toUnitShape(toDate)))) {
      // Debugging the tokens produced is difficult as I can't intuitively look at them and know if they are valid.
      doc.add(field);
    }
    w.addDocument(doc);
  }
}

Update:

  • I thought maybe the answer was to use SimpleAnalyzer compared to StandardAnalyzer, but this doesn't appear to work either.

  • My requirement of being able to parse user date range's does seem to be catered by SOLR, so I would expect this to be based on Lucene functionality.

2
I thought maybe the answer was to use SimpleAnalyzer compared to StandardAnalyzer, but this doesn't appear to work either.Daniel Gerson

2 Answers

1
votes

Firstly QueryParser can parse dates and produce a TermRangeQuery by default. See the following method of the default parser which produces a TermRangeQuery.

org.apache.lucene.queryparser.classic.QueryParserBase#getRangeQuery(java.lang.String, java.lang.String, java.lang.String, boolean, boolean)

This assumes that you'll be storing dates as strings in the lucene database, which is a little inefficient but works straight out the box, provided a SimpleAnalyzer or equivalent is used.

Alternatively you can store the dates as LongPoint which would be the most efficient for the date scenario as per my question above where a date is a point in time and one date stored per field.

Calendar fromDate = ...
doc.add(new LongPoint(FIELDNAME, fromDate.getTimeInMillis()));

but here like suggested for DatePrefixTree, this requires writing hard coded queries.

Query pointRangeQueryHardCoded = LongPoint.newRangeQuery(FIELDNAME, fromDate.getTimeInMillis(), toDate.getTimeInMillis());

It is possible to reuse QueryParser even here, if the following method is overridden with a version that produces a LongPoint range query.

org.apache.lucene.queryparser.classic.QueryParserBase#newRangeQuery(java.lang.String, java.lang.String, java.lang.String, boolean, boolean)

This can also be done for the datePrefix tree version, but this scheme is only worthwhile if:

  • You wanted to search by some unusual token (I believe it could accommodate Mondays).
  • You had multiple dates per document field.
  • You were storing date ranges which needed to be queried over.

Adapting the query parser to have a convenient lingo that captures all relevant scenarios I imagine would be a fair amount of work for this last case.

Additionally please be careful not to mix Date(YEAR, MONTH, DAY) with GregorianCalendar(YEAR, MONTH, DAY) as the arguments are not equal and will cause problems.

See java.util.Date#Date(int, int, int) for how different the arguments are and why this constructor is deprecated. This caught me out as per the code in the question.

Thanks again to femtoRgon for pointing out the mechanics of the spatial search, but in the end this wasn't the way for me to go.

0
votes

The QueryParser is not going to be useful for searching on spatial fields, and the analyzer isn't going to make any difference. Analyzers are designed to tokenize and transform text. As such, they aren't used by spatial fields. Similarly, the QueryParser is primarily geared around text searching, and has no support for spatial queries.

You'll need to query using a spatial query. In particular, the subclasses of AbstractPrefixTreeQuery will be useful.

For instance, if I want to query for documents whose time field is a range that contains the years 2003 - 2005, I could create a query like:

Shape queryShape = prefixTree.toRangeShape(
    prefixTree.toUnitShape(new GregorianCalendar(2003,1,1)), 
    prefixTree.toUnitShape(new GregorianCalendar(2005,12,31)));

Query q = new ContainsPrefixTreeQuery(
          queryShape,
          "time",
          prefixTree,
          10,
          false
  );

So this would match a document that had been indexed, for instance, with the range 2000-01-01 to 2006-01-01.

Or to go the other way and match all documents whose ranges fall entirely within the query range:

Shape queryShape = prefixTree.toRangeShape(
    prefixTree.toUnitShape(new GregorianCalendar(1990,1,1)), 
    prefixTree.toUnitShape(new GregorianCalendar(2020,12,31)));

Query q = new WithinPrefixTreeQuery(
          queryShape,
          "time",
          prefixTree,
          10,
          -1,
          -1
  );

Note on arguments: I don't really understand some of the parameters to these queries, particularly detailLevel and prefixGridScanLevel. Haven't found any documentation on how exactly they work. These values seem to work in my basic tests, but I don't know what the best choices would be.