2
votes

I have names of all the employees of my company (5000+). I want to write an engine which can on the fly find names in online articles(blogs/wikis/help documents) and tag them with "mailto" tag with the users email.

As of now I am planning to remove all the stop words from the article and then search for each word in a lucene index. But even in that case I see a lot of queries hitting the indexes, for example if there is an article with 2000 words and only two references to people names then most probably there will be 1000 lucene queries.

Is there a way to reduce these queries? Or a completely other way of achieving the same? Thanks in advance

2
I am not sure I am following, isn't the list of employees pre-defined? aren't these names your queries?amit
@amit list of employees is 5000, are you asking if I should search for each name in the article? 5000 queries in 2000 word document? I was wondering other way around.Sap
you have only one document? if you do, lucene won't help you much..amit
@amit nope i have lot's of documents, I am using one doc as example. But I want to do this on the fly. This means that when a user is typing his wiki in the preview area it should on the fly mark a name with email addressSap
If I understand correctly, what you'd like to do is search your list of names for terms that people type, so that you can offer them suggestions of email address, etc. when the text they typed is a name of a person in your collection. Is that correct?Gene Golovchinsky

2 Answers

5
votes

If you have only 5000 names, I would just stick them into a hash table in memory instead of bothering with Lucene. You can hash them several ways (e.g., nicknames, first-last or last-first, etc.) and still have a relatively small memory footprint and really efficient performance.

1
votes

http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
This algorithm might be of use to you. The way this would work is you first compile the entire list of names into a giant finite state machine (which would probably take a while), but then once this state machine is built, you can run it through as many documents as you want and detect names pretty efficiently.
I think it would look at every character in each document only once, so it should be much more efficient than tokenizing the document and comparing each word to a list of known names.
There are a bunch of implementations available for different languages on the web. Check it out.