14
votes

I am looking for a document search engine (like Xapian, Whoosh, Lucene, Solr, Sphinx or others) which is capable of searching partial terms.

For example when searching for the term "brit" the search engine should return documents containing either "britney" or "britain" or in general any document containing a word matching r*brit*

Tangentially, I noticed most engines use TF-IDF (Term frequency-Inverse document frequency) or its derivatives which are based on full terms and not partial terms. Are there any other techniques that have been successfully implemented besides TF-IDF for document retrieval?

1
I reccomend that you add a search engine tag to your question, lucene, Xapian, or at least search-engine. Search is a general tag, people that are into search-engines may get tired reading all sorts of weird requests for non search-engine related questions. Good Luck!shellter
Thanks for the suggestion shelter. Added more tags.GeneralBecos
Any reason you have not read the documentation of the various engines. Lucene (and therefore Solr) support wildcard searches: wiki.apache.org/lucene-java/…ewh

1 Answers

19
votes

With lucene you would be able to implement this in several ways:

1.) You can use wildcard queries *brit* (You would have to set your query parser to allow leading wild cards)

2.) You can create an additional field containing N-Grams of all the terms. This would result in larger indexes, but would be in many cases faster (search speed).

3.) You can use fuzzy search to handle typing mistakes in the query. e.g. someone typed britnei but wanted to find britney.

For wildcard queries and fuzzy search have a look at the query syntax docs.