1
votes

Some of the documents I store in Lucene have fields that contain file paths or URIs. I'd like users to be able to retrieve these documents if their query terms contain a path or URI segment.

For example, if the path is

C:\home\user\research\whitepapers\analysis\detail.txt

I'd like the user to be able to find it by queriying for path:whitepapers.

Likewise, if the URI is

http://www.stackoverflow.com/questions/ask

A query containing uri:questions would retrieve it.

Do I need to use a special analyzer for these fields, or will StandardAnaylzer do the job? Will I need to do any pre-processing of these fields? (To replace the forward slashes or backslashes with spaces, for example?)

Suggestions welcome!

1

1 Answers

0
votes

You can use StandardAnalyzer. I tested this, by adding the following function to Lucene's TestStandardAnalyzer.java:

public void testBackslashes() throws Exception {
  assertAnalyzesTo(a, "C:\\home\\user\\research\\whitepapers\\analysis\\detail.txt", new String[]{"c","home", "user", "research","whitepapers", "analysis", "detail.txt"});
  assertAnalyzesTo(a, "http://www.stackoverflow.com/questions/ask", new String[]{"http", "www.stackoverflow.com","questions","ask"});

}

This unit test passed using Lucene 2.9.1. You may want to try it with your specific Lucene distribution. I guess it does what you want, while keeping domain names and file names unbroken. Did I mention that I like unit tests?