3
votes

I am developing an advanced search engine using .Net where users can build their query based on several Fields:

  1. Title
  2. Content of the Document
  3. Date From, Date To
  4. From Modified Date, To modified Date
  5. Owner
  6. Location
  7. Other Metadata

I am using lucene to index Document Content and their Corresponding IDs. However, the other metadata resides in MS SQL DB (to avoid enlarging the index, and keep updating the index on any modification of the metadata).

How I can Perform the Search?

when any user search for a term:

  • Narrow down the search results according to criteria selected by user by looking up in the SQL DB.
  • Return the matching IDs to the lucene searcher web service, which search for keyword entered in the DocumnentIDs returned From the Adv Search web service.
  • Then Get the relevant metadata for the Document ids (returned from lucence) by looking again in the DB.

AS you notice here, there is one lookup in DB, then Lucene, and Finally DB to get the values to be displayed in Grid.

Questions:

How can overcome this situation? I thought to begin searching lucene but this has a drawback if the Documents indexed reached 2 million. (i think narrowing down the results using the DB first have large effect on performance).

Another issue is passing IDs to lucene Search Service, how effective is passing hundred thousands of IDs? and what is the alternative solution?

I welcome any idea, so please share your thoughts.

1

1 Answers

9
votes

Your current solution incurs the following overhead at query-time:

1) Narrowing search space via MS-SQL

  • Generating query in your app
  • Sending it over the wire to MS-SQL
  • Parsing/Optimizing/Execution of SQL query
  • [!!] I/O overhead of returning 100,000s of IDs

2) Executing bounded full-text search via Lucene.NET

  • [!!] Lucene memory overhead of generating/executing large BooleanQuery containing 100,000s of ID clauses in app (you'll need to first override the default limit of 1024 clauses to even measure this effect)
  • Standard Lucene full text search execution
  • Returning matching IDs

3) Materializing result details via MS-SQL

  • Fast, indexed, ID-based lookup of search result documents (only needed for the first page of displayed results usually about ~10-25 records)

There are two assumptions you may be making that would be worth reconsidering

A) Indexing all metadata (dates, author, location, etc...) will unacceptably increase the size of the index.

  • Try it out first: This is the best practice, and you'll massively reduce your query execution overhead by letting Lucene do all of the filtering for you in addition to text search.

  • Also, the size of your index has mostly to do with the cardinality of each field. For example, if you have only 500 unique owner names, then only those 500 strings will be stored, and each lucene document will internally reference their owner through a symbol-table lookup (4-byte integer * 2MM docs + 500 strings = < 8MB additional).

B) MS-SQL queries will be the quickest way to filter on non-text metadata.

  • Reconsider this: With your metadata properly indexed using the appropriate Lucene types, you won't incur any additional overhead querying Lucene vs query MS-SQL. (In some cases, Lucene may even be faster.)
  • Your mileage may vary, but in my experience, this type of filtered-full-text-search when executed on a Lucene collection of 2MM documents will typically run in well under 100ms.

So to summarize the best practice:

  • Index all of the data that you want to query or filter by. (No need to store source data since MS-SQL is your system-of-record).

  • Run filtered queries against Lucene (e.g. text AND date ranges, owner, location, etc...)

  • Return IDs

  • Materialize documents from MS-SQL using returned IDs.


I'd also recommend exploring a move to a standalone search server (Solr or Elasticsearch) for a number of reasons:

  1. You won't have to worry about search-index memory requirements cannibalizing application memory requirements.
  2. You'll take advantage of sophisticated filter caching performance boosts and OS-based I/O optimizations.
  3. You'll be able to iterate upon your search solution easily from a mostly configuration-based environment that is widely used/supported.
  4. You'll have tools in place to scale/tune/backup/restore search without impacting your application.