5
votes

I'm having problems getting a simple URL to tokenize properly so that you can search it as expected.

I'm indexing "http://news.bbc.co.uk/sport1/hi/football/internationals/8196322.stm" with the StandardAnalyzer and it is tokenizing the string as the following (debug output):

(http,0,4,type=<ALPHANUM>)
(news.bbc.co.uk,7,21,type=<HOST>)
(sport1/hi,22,31,type=<NUM>)
(football,32,40,type=<ALPHANUM>)
(internationals/8196322.stm,41,67,type=<NUM>)

In general it looks good, http itself, then the hostname but the issue seems to come with the forward slashes. Surely it should consider them as seperate words?

What do I need to do to correct this?

Thanks

P.S. I'm using Lucene.NET but I really don't think it makes much of a difference with regards to the answers.

2

2 Answers

5
votes

The StandardAnalyzer, which uses the StandardTokenizer, doesn't tokenize urls (although it recognised emails and treats them as one token). What you are seeing is it's default behaviour - splitting on various punctuation characters. The simplest solution might be to use a write a custom Analyzer and supply a UrlTokenizer, that extends/modifies the code in StandardTokenizer, to tokenize URLs. Something like:

public class MyAnalyzer extends Analyzer {

public MyAnalyzer() {
    super();
}

public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new MyUrlTokenizer(reader);
    result = new LowerCaseFilter(result);
    result = new StopFilter(result);
    result = new SynonymFilter(result);

    return result;
}

}

Where the URLTokenizer splits on /, - _ and whatever else you want. Nutch may also have some relevant code, but I don't know if there's a .NET version.

Note that if you have a distinct fieldName for urls then you can modify the above code the use the StandardTokenizer by default, else use the UrlTokenizer.

e.g.

public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = null;
            if (fieldName.equals("url")) {
                  result = new MyUrlTokenizer(reader);
            } else {
                  result = new StandardTokenizer(reader);
            }
1
votes

You should parse the URL yourself (I imagine there's at least one .Net class that can parse a URL string and tease out the different elements), then add those elements (such as the host, or whatever else you're interested in filtering on) as Keywords; don't Analyze them at all.