Query partial URLs using Lucene

Question

My application uses Lucene.NET to index various text files. Since each text file is different in structure, the entire content of each file is stored in a single "content" field.

Some of the text files contains URLs, e.g:

http://domain1.co.uk/blah
http://domain2.co.ru/blahblah

etc.

The code I use to index each file is:

Lucene.Net.Documents.Field fldContent = new Lucene.Net.Documents.Field("content", contents, Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.TOKENIZED, Lucene.Net.Documents.Field.TermVector.YES);

Where "contents" is the file contents.

When querying the file, Lucene returns result only when searching for the exact domain name (e.g domain1.co.uk) and nothing is returned for partial domain name (e.g domain1.co). The code used to build the query is:

Lucene.Net.Index.Term searchTerm = new Lucene.Net.Index.Term("content", "domain1.co");
Lucene.Net.Search.Query query = new Lucene.Net.Search.TermQuery(searchTerm);

Do you have any idea why must I search using the exact domain name?

danben danben · Accepted Answer · 2009-12-24T13:45:56

Which Analyzer are you specifying for your IndexWriter? Telling Lucene to tokenize a field won't do you any good if it's being tokenized the wrong way. For what you want, it sounds like you need to make sure that your tokenizer is splitting on "." and maybe also that it is generating n-grams (the latter may not be necessary). You should look more into the various analyzers available and see which one's tokenization behavior gets you closest to what you want. Otherwise, you could always write a custom Analyzer. Make sure that you use the same Analyzer for indexing as for searching, so if you index "domain1.co.uk" which turns into "domain1 co uk" and you search for "domain1.co" which turns into "domain co", you will have a match there, whereas the untokenized query "domain1.co" would not match.

Query partial URLs using Lucene

2 Answers