0
votes

I want to set up a search in Lucene (actually Lucene.NET, but I can convert from Java as necessary) using the following logic:

  1. Search string is: A B C
  2. Search one field in the index for anything that matches A, B, or C. (Query: (field1:A field1:B field1:C))
  3. For each term that didn't match in step 2, search a second field for it while keeping the results from the first search (Query: (+(field1:A) +(field2:B field2:C)))
  4. For each term that didn't match in step 3, search a third field...
  5. Continue until running out of fields, or there's a search which has used every term.

Currently, my code can test whether a given search produces NO results, and ANDs together all the ones that do produce results. But I have no way to stop it before it tests against every field (which unnecessarily limits the results) - it's currently ending up with a query like: (+(field1:A field1:B field1:C) +(field3:A field3:B field3:C)) when I want it to be (+(field1:A field1:C) +(field3:B)). I can't just look at the results from the first search and remove words from the search string because the Analyzer mangles the words when it parses it for search, and I have no way to un-mangle them to figure out which of the original search terms it corresponds to.

Any suggestions?


Edit: Ok, generally I prefer describing my problems in the abstract, but I think some part of it is getting lost in the process, so I'll be more concrete.

I'm building a search engine for an site which needs to have several layers of search logic. A few example searches which I'll trace out are:

  1. Headphones
  2. Monster Headphones
  3. White Monster Headphones
  4. White Foobar Headphones

The index contains documents with seven fields - the relevant ones to this example are:

  • "datattype": A string representing what type of item this document represents (product, category, brand), so we know how to display it
  • "brand": The brand(s) that are relevant (categories have multiple brands, products and brands have one each)
  • "path": The path to a given category (i.e. "Audio Headphones In-Ear" for "Audio > Headphones > In-Ear")
  • "keywords": Various things that describe the product that don't go anywhere else.

In general, the logic for each step of the search is as follows:

  1. Check to see if we have a match.
  2. If so, filter the results based on that match, and continue parsing the rest of the search terms in the next step.
  3. If not, parse the search terms in the next step.

Each step is something like:

  • Search for a category
  • Search for a brand
  • Search for keywords

So here's how those three example searches should play out:

  1. Headphones
    • Search for a category: +path:headphones +datatype:Category
    • There are matches (the Headphone category), and no words from the original query are left, so we return it.
  2. Monster Headphones
    • Search for a category: `+(path:monster path:headphones) +datatype:Category
    • Matches were found for path:headphones and datatype:Category, leaving "Monster" unmatched
    • Search for a brand: +path:headphones +brand:monster
    • Matches were found for path:headphones and brand:monster, and no words from the original query are left, so we return all the headphones by Monster.
  3. White Monster Headphones
    • Search for a category: +(path:monster path:headphones path:white) +datatype:Category
    • Matches were found for path:headphones, and datatype:Category, leaving "White" and "Monster" unmatched
    • Search for a brand: +path:headphones +(brand:monster +brand:white)
    • Matches were found for path:headphones and brand:monster, leaving "White" unmatched
    • Search keywords: +path:headphones +brand:monster +keywords:white
    • There are matches, and no words from the original query are left, so we return them.
  4. White Foobar Headphones
    • Search for a category: +(path:foobar path:headphones path:white) +datatype:Category
    • Matches were found for path:headphones, and datatype:Category, leaving "White" and "Foobar" unmatched
    • Search for a brand: +path:headphones +(brand:foobar +brand:white)
    • Nothing was found, so we continue.
    • Search keywords: +path:headphones +(keywords:white keywords:foobar)
    • Matches were found for path:headphones and keywords:white, leaving "Foobar" unmatched
    • ... (continue searching other fields, including product description) ...
    • There are search terms still unmatched ("Foobar"), return "No results found"

The problem I have is twofold:

  1. I don't want the matches to continue once everything's matched (only products have descriptions, so once it reaches that step we'll never return something that's not a product). I could manage this by using denis's GetHitTerms from here, except that I then end up searching for the first matched term in all subsequent fields until everything matches (i.e. in example #2, I'd have +path:headphones +(brand:headphones brand:monster)).
  2. Despite my example above, my actual search query on the path field looks like +path:headphon +datatype:Taxonomy because I'm mangling it for searching. So I can't take the matched term and just remove that from the original query (because "headphon" != "headphones").

Hopefully that makes it clearer what I'm looking for.

2
Have you looked at this? cwiki.apache.org/LUCENENET/simple-faceted-search.html There may be ideas you can use.L.B
@denis: At first glance, that does look like what I need. I'll need to delve into it more to be sure, but it does look promising.Bobson
@denis: Good news and bad news. It does seem to be what I need, but my data set is far, far too large for it to handle. Even if I bump up the MAX_FACETS const. I have 455 brand terms, 251 path terms, and 10094 keyword terms, just from the fields above. If it faceted everything except descriptions, it weighs in at almost 4 trillion facets.Bobson
I bet you haven't read References to adding faceted to Lucene.Net section in that page. There are other tricks utilizing Collector class to make a faceted search. See mail-archives.apache.org/mod_mbox/lucene-lucene-net-dev/…L.B
@denis: I've read it now, but I don't understand what to do with the result, or how to tell it I want multiple fields to be part of my search.Bobson

2 Answers

1
votes

I don't understand your use case, but you sound like you're asking about the BooleanQuery API. You can get the clauses of your query by calling getClauses.

A simple example:

BooleanQuery bq = new BooleanQuery();
bq.add(new TermQuery(new Term("field1","a")), BooleanClause.Occur.SHOULD)
bq.add(new TermQuery(new Term("field1","b")), BooleanClause.Occur.SHOULD)

BooleanClause[] clauses = bq.getClauses();

EDIT: maybe you're just asking for a search algorithm. In pseudocode:

generate_query (qs_that_matched, qs_that_didnt_match, level):
   new_query = qs_that_matched AND level:qs_that_didnt_match
   qs_still_unmatched = ...
   qs_which_just_matched = ...
   if qs_still_unmatched != null:
      return generate_query(qs_that_matched AND qs_which_just_matched, qs_still_unmatched, level+1)
   else:
      return qs_that_matched AND qs_which_just_matched
0
votes

In the end, I built a QueryTree class and stored the queries in a tree structure. It stores a reference to a function that takes a query, a list of terms to pump into that query, whether it should AND or OR those terms, and a list of children (which represent unique combinations of matching terms).

To perform the next level of searching, I just call Evaluate(Func<string, QueryParser.Operator, Query> newQuery) on the deepest nodes in my tree, with a reference to a function which takes terms and an operator and returns the correct Query for that set of logic. The Evaluate function then tests that new query against the list of unmatched terms that have been passed down to it and the result sets of all ancestral Querys (by ANDing with the parent, which ANDs with it's parent and so on). It then creates children for each set of matching terms, using GetHitTerms, and gives the unmatched terms to the child. Repeat for each level of search.


I suspect that there's a better way to do this - I didn't even look into Bobo that Xodarap mentioned, and I never really got faceted searching (as per denis) working. However, it's working, which means it's time to move on to other aspects of the site.