31
votes

I'm using Lucene.net, but I am tagging this question for both .NET and Java versions because the API is the same and I'm hoping there are solutions on both platforms.

I'm sure other people have addressed this issue, but I haven't been able to find any good discussions or examples.

By default, Lucene is very picky about query syntax. For example, I just got the following error:

[ParseException: Cannot parse 'hi there!': Encountered "<EOF>" at line 1, column 9.
Was expecting one of:
    "(" ...
    "*" ...
    <QUOTED> ...
    <TERM> ...
    <PREFIXTERM> ...
    <WILDTERM> ...
    "[" ...
    "{" ...
    <NUMBER> ...
    ]
   Lucene.Net.QueryParsers.QueryParser.Parse(String query) +239

What is the best way to prevent ParseExceptions when processing queries from users? It seems to me that the most usable search interface is one that always executes a query, even if it might be the wrong query.

It seems that there are a few possible, and complementary, strategies:

  • "Clean" the query prior to sending it to the QueryProcessor
  • Handle exceptions gracefully
    • Show an intelligent error message to the user
    • Perhaps execute a simpler query, leaving off the erroneous bit

I don't really have any great ideas about how to do any of those strategies. Has anyone else addressed this issue? Are there any "simple" or "graceful" parsers that I don't know about?

6

6 Answers

44
votes

Yo can make Lucene ignore the special characters by sanitizing the query with something like

query = QueryParser.Escape(query)

If you do not want your users to ever use advanced syntax in their queries, you can do this always.

If you want your users to use advanced syntax but you also want to be more forgiving with the mistakes you should only sanitize after a ParseException has occured.

8
votes

Well, the easiest thing to do would be to give the raw form of the query a shot, and if that fails, fall back to cleaning it up.

Query safe_query_parser(QueryParser qp, String raw_query)
  throws ParseException
{
  Query q;
  try {
    q = qp.parse(raw_query);
  } catch(ParseException e) {
    q = null;
  }
  if(q==null)
    {
      String cooked;
      // consider changing this "" to " "
      cooked = raw_query.replaceAll("[^\w\s]","");
      q = qp.parse(cooked);
    }
  return q;
}

This gives the raw form of the user's query a chance to run, but if parsing fails, we strip everything except letters, numbers, spaces and underscores; then we try again. We still risk throwing ParseException, but we've drastically reduced the odds.

You could also consider tokenizing the user's query yourself, turning each token into a term query, and glomming them together with a BooleanQuery. If you're not really expecting your users to take advantage of the features of the QueryParser, that would be the best bet. You'd be completely(?) robust, and users could search for whatever funny characters will make it through your analyzer

3
votes

FYI... Here is the code I am using for .NET

private Query GetSafeQuery(QueryParser qp, String query)
{
    Query q;
    try 
    {
        q = qp.Parse(query);
    } 

    catch(Lucene.Net.QueryParsers.ParseException e) 
    {
        q = null;
    }

    if(q==null)
    {
        string cooked;

        cooked = Regex.Replace(query, @"[^\w\.@-]", " ");
        q = qp.Parse(cooked);
    }

    return q;
}
1
votes

I'm in the same situation as you.

Here's what I do. I do catch the exception, but only so that I can make the error look prettier. I don't change the text.

I also provide a link to an explanation of the Lucene syntax which I have simplified a little bit:
http://ifdefined.com/btnet/lucene_syntax.html

1
votes

I do not know much about Lucene.net. For general Lucene, I highly recommend the book Lucene in Action. For the question at hand, it depends on your users. There are strong reasons, such as ease of use, security and performance, to limit your users' queries. The book shows ways to parse the queries using a custom parser instead of QueryParser. I second Jay's idea about the BooleanQuery, although you can build stronger queries using a custom parser.

1
votes

If you don't need all Lucene features, you might go better by writing your own query parser. It's not as complicated as it might seem in the first place.