6
votes

I have a Java based application and a set of keywords in a MySQL database (in total about 3M keywords, each of them may include more than one words, eg it can be: “memory”, “old house”, “European Union law”, etc).

The user interacts with the application by uploading a document with arbitrary text (several pages most of the times). What I want to do is to search if and where in the document any of the 3 million keywords appear.

I have tried using a loop and searching the document for each keyword but this is not efficient at all. I am wondering if there is a library to perform the search in a more time efficient manner.

I would greatly appreciate any help.

3
What about store hash for each keyword in column next to keyword, and during reading document checking each word by, for ex select keyword from keywords where keyword_hash = calculateHash(wordToCheck)? - rzysia
What you need to consider is what would be the shortest path. Doing 3 million searches, or building N phrases from the uploaded document. A solution could be to construct a search of all of the 3M keywords to search the document. Use Lucenes Keyword Highlighter and match all highlighted words with the 3M keywords ;) - Andreas Lundberg
Is there a way to get multi keyword results within the same extracted portion of text in the highlighter ? Or even better is there a structure that can return the list of matched keywords that are found within the file? - Nikolaos Papadakis

3 Answers

5
votes

project Apache Lucene may be helpful.

Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

you can find some useful tutorials here

1
votes

You could try using a bloom filter http://en.wikipedia.org/wiki/Bloom_filter. Then check each word(s) against the bloom filter to find out positives. Please remember there could be false positives. Therefore if there are positives from the bloom filter then you could try a sql query like 'select keyword from keywordtable where keyword in (positives from bloom filter) ' to concretely identify which keywords are present in the uploaded document.

Java implementation of bloom filter available in Guava library. http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/hash/BloomFilter.html

1
votes

You can use The Lemur Project also available at sourceforge:

The Lemur Project develops search engines, browser toolbars, text analysis tools, and data resources that support research and development of information retrieval and text mining software, including the Indri search engine and ClueWeb09 dataset.

And as Recommended by Taher the Apache Lucene is a nice tool, And I've used both of them and they're great.