12
votes

I would like to implement full-text-search in my off-line (android) application to search the user generated list of notes.

I would like it to behave just like Google (since most people are already used to querying to Google)

My initial requirements are:

  • Fast: like Google or as fast as possible, having 100000 documents with 200 hundred words each.
  • Searching for two words should only return documents that contain both words (not just one word) (unless the OR operator is used)
  • Case insensitive (aka: normalization): If I have the word 'Hello' and I search for 'hello' it should match.
  • Diacritical mark insensitive: If I have the word 'así' a search for 'asi' should match. In Spanish, many people, incorrectly, either do not put diacritical marks or fail in correctly putting them.
  • Stop word elimination: To not have a huge index meaningless words like 'and', 'the' or 'for' should not be indexed at all.
  • Dictionary substitution (aka: stem words): Similar words should be indexed as one. For example, instances of 'hungrily' and 'hungry' should be replaced with 'hunger'.
  • Phrase search: If I have the text 'Hello world!' a search of '"world hello"' should not match it but a search of '"hello world"' should match.
  • Search all fields (in multifield documents) if no field specified (not just a default field)
  • Auto-completion in search results while typing to give popular searches. (just like Google Suggest)

How may I configure a full-text-search engine to behave as much as possible as Google?

(I am mostly interested in Open Source, Java and in particular Lucene)

5
If you're pointed towards Lucene, which handles many of these out of the box and is extensible, can you be specific about what problem you're having?Michael Brewer-Davis
You do realize that Google employs en.wikipedia.org/wiki/Query_expansion and utilizes search history by other people. If everyone who searches for "computer mice" then end up clicking on yahoo.com (making it up), then over time yahoo.com will become the first search result for "computer mice".Hamish Grubijan
How many words do you have in your application? Brute force may be quite enough.Thorbjørn Ravn Andersen

5 Answers

15
votes

I think Lucene can address your requirements. You should also consider using Solr, which has similar functionality and is much easier to set up.

I will discuss each requirement separately, using Lucene. I believe Solr has similar mechanisms.

  • Fast: like Google or as fast as possible, having 100000 documents with 200 hundred words each.

This is a reasonable index size both for Lucene and Solr, enabling retrieval at several tens of milliseconds per query.

  • Searching for two words should only return documents that contain both words (not just one word) (unless the OR operator is used)

You can do that using a BooleanQuery with MUST as default in Lucene.

The next four requirements can be handled by customizing a Lucene Analyzer:

  • Case insensitive (aka: normalization): If I have the word 'Hello' and I search for 'hello' it should match.

A LowerCaseFilter can be used for this.

  • Diacritical mark insensitive: If I have the word 'así' a search for 'asi' should match. In Spanish, many people, incorrectly, either do not put diacritical marks or fail in correctly putting them.

This requires Unicode normalization followed by diacritic removal. You can build a custom Analyzer for this.

  • Stop word elimination: To not have a huge index meaningless words like 'and', 'the' or 'for' should not be indexed at all.

A StopFilter removes stop words in Lucene.

  • Dictionary substitution (aka: stem words): Similar words should be indexed as one. For example, instances of 'hungrily' and 'hungry' should be replaced with 'hunger'.

Lucene has many Snowball Stemmers. One of them may be appropriate.

  • Phrase search: If I have the text 'Hello world!' a search of '"world hello"' should not match it but a search of '"hello world"' should match.

This is covered by the Lucene PhraseQuery specialized query.

As you can see, Lucene covers all of the required functionality. To get a more general picture, I suggest the book Lucene in Action, The Apache Lucene Wiki or The Lucid Imagination Site.

3
votes

A lot of these behaviors are default for Lucene. The first (including all terms) is not, but you can force this behavior by setting the default operator:

MultiFieldQueryParser parser = new MultiFieldQueryParser(fields, new StandardAnalyzer());
parser.setDefaultOperator(QueryParser.AND_OPERATOR);

I know that items 2, 4, and 6 are possible, and IIRC, they happen by default. I'm not sure about items 3 and 5, but Lucene offers a ton of customization options, so I'd suggest implementing a proof-of-concept with your data to see if it meets these requirements as well.

1
votes

Buy a Google Search Appliance. Or, as the comments say, use Lucene like you already mentioned.

0
votes

HyperSQL is a pure-java SQL implementation that can be ran quite easily, as can SQLite. You could use their full-text capabilities and querying to re-create the wheel, but as the other commenters have pointed out an existing implementation is probably best.

-1
votes

Unless you buy a search engine, you have Lucene, Nutch, Apache Solr and few others.