I want to set up a search in Lucene (actually Lucene.NET, but I can convert from Java as necessary) using the following logic:
- Search string is: A B C
- Search one field in the index for anything that matches A, B, or C. (Query:
(field1:A field1:B field1:C)
) - For each term that didn't match in step 2, search a second field for it while keeping the results from the first search (Query:
(+(field1:A) +(field2:B field2:C))
) - For each term that didn't match in step 3, search a third field...
- Continue until running out of fields, or there's a search which has used every term.
Currently, my code can test whether a given search produces NO results, and ANDs together all the ones that do produce results. But I have no way to stop it before it tests against every field (which unnecessarily limits the results) - it's currently ending up with a query like: (+(field1:A field1:B field1:C) +(field3:A field3:B field3:C))
when I want it to be (+(field1:A field1:C) +(field3:B))
. I can't just look at the results from the first search and remove words from the search string because the Analyzer mangles the words when it parses it for search, and I have no way to un-mangle them to figure out which of the original search terms it corresponds to.
Any suggestions?
Edit: Ok, generally I prefer describing my problems in the abstract, but I think some part of it is getting lost in the process, so I'll be more concrete.
I'm building a search engine for an site which needs to have several layers of search logic. A few example searches which I'll trace out are:
- Headphones
- Monster Headphones
- White Monster Headphones
- White Foobar Headphones
The index contains documents with seven fields - the relevant ones to this example are:
- "datattype": A string representing what type of item this document represents (product, category, brand), so we know how to display it
- "brand": The brand(s) that are relevant (categories have multiple brands, products and brands have one each)
- "path": The path to a given category (i.e. "Audio Headphones In-Ear" for "Audio > Headphones > In-Ear")
- "keywords": Various things that describe the product that don't go anywhere else.
In general, the logic for each step of the search is as follows:
- Check to see if we have a match.
- If so, filter the results based on that match, and continue parsing the rest of the search terms in the next step.
- If not, parse the search terms in the next step.
Each step is something like:
- Search for a category
- Search for a brand
- Search for keywords
So here's how those three example searches should play out:
- Headphones
- Search for a category:
+path:headphones +datatype:Category
- There are matches (the Headphone category), and no words from the original query are left, so we return it.
- Search for a category:
- Monster Headphones
- Search for a category: `+(path:monster path:headphones) +datatype:Category
- Matches were found for
path:headphones
anddatatype:Category
, leaving "Monster" unmatched - Search for a brand:
+path:headphones +brand:monster
- Matches were found for
path:headphones
andbrand:monster
, and no words from the original query are left, so we return all the headphones by Monster.
- White Monster Headphones
- Search for a category:
+(path:monster path:headphones path:white) +datatype:Category
- Matches were found for
path:headphones
, anddatatype:Category
, leaving "White" and "Monster" unmatched - Search for a brand:
+path:headphones +(brand:monster +brand:white)
- Matches were found for
path:headphones
andbrand:monster
, leaving "White" unmatched - Search keywords:
+path:headphones +brand:monster +keywords:white
- There are matches, and no words from the original query are left, so we return them.
- Search for a category:
- White Foobar Headphones
- Search for a category:
+(path:foobar path:headphones path:white) +datatype:Category
- Matches were found for
path:headphones
, anddatatype:Category
, leaving "White" and "Foobar" unmatched - Search for a brand:
+path:headphones +(brand:foobar +brand:white)
- Nothing was found, so we continue.
- Search keywords:
+path:headphones +(keywords:white keywords:foobar)
- Matches were found for
path:headphones
andkeywords:white
, leaving "Foobar" unmatched - ... (continue searching other fields, including product description) ...
- There are search terms still unmatched ("Foobar"), return "No results found"
- Search for a category:
The problem I have is twofold:
- I don't want the matches to continue once everything's matched (only products have descriptions, so once it reaches that step we'll never return something that's not a product). I could manage this by using denis's GetHitTerms from here, except that I then end up searching for the first matched term in all subsequent fields until everything matches (i.e. in example #2, I'd have
+path:headphones +(brand:headphones brand:monster)
). - Despite my example above, my actual search query on the path field looks like
+path:headphon +datatype:Taxonomy
because I'm mangling it for searching. So I can't take the matched term and just remove that from the original query (because "headphon" != "headphones").
Hopefully that makes it clearer what I'm looking for.
References to adding faceted to Lucene.Net section
in that page. There are other tricks utilizing Collector class to make a faceted search. See mail-archives.apache.org/mod_mbox/lucene-lucene-net-dev/… – L.B