2
votes

I have a complex query that requires a full-text search on some fields and basic restrictions on other fields. Hibernate Search documentation strongly advises against adding database query restrictions to a full text search query and instead recommends putting all of the necessary fields into the full-text index. The problem I have with that is that the other fields are volatile; values can change every minute or so and those updates to the database may occur outside of the JVM doing the search, so there is a high likelihood that the local Lucene index would be out of date with respect to those fields.

Looking for strategy recommendations here. The best I've come up with so far is to join the results manually by first executing the database query (fetching only object IDs) and then execute the full text search. and somehow efficiently filter the Lucene results by the set of object IDs from the database. Of course, I don't know how many results I'll get from each separate query, so I'm worried about performance and memory. It could be tens of thousands of rows apiece in the worst case.

3

3 Answers

0
votes

I am quite interested in other ideas for this as we have a very similar scenario.

We only needed to show 50 results rows as a maximum with a couple of lookups per row. We run the query against the lucene index with the db pk ids in the index and the pull the lookups out of the database per row. It's still performant for us.

As you seem to want to process more than a few rows and lookups I did consider an alternative. Timestamp any db row updates. This would allow us to query the DB for stale indexes and then iteratively call a reindex of related documents.

0
votes

I have the same problem and do a separate Lucene and criteria query. If I first do the criteria query I will use the resulting ids to apply a custom IdFilter for Lucene search which checks whether the result is in the given Id collection from the first query. However this approach does not scale well because also in my case the number of results after the first query might be huge and the filter is limited to 1024 ids. I did not find a good solution but I change the order of my two queries depending on the number of the to be expected results. The first query should be the one which filters out most of the results.

0
votes

You can do a scheduler index update base on the last modified date.