Not able to exclude partial string in solr 5.3.1?

Question

String is :- <GET:notes/count><GET:notes/search_note><GET:util/codemaps/([^/]+?)><GET:users/pending_requests><GET:users/pending_activation><GET:users/firstnames><GET:users/profile><GET:tasks/tasks/count><GET:school/schools/count><GET:school/classrooms/count><GET:quiz/count><GET:quiz/quizset/count><GET:notes/([^/]+?)><GET:locations/counties/count><GET:lesson/books/count><GET:general/codemaps/([^/]+?)><GET:discussions/topics/count><GET:admin/sessions><GET:admin/sessions/count><GET:admin/sessions/([^/]+?)><PUT:content/actions><POST:content/html/totext><GET:content/multimedia/images/([^/]+?)/([^/]+?)>

my query is:

<pre>log_message:"*emaps/\(\[\^/\]\+\?\)\>*"</pre>

here log_message is field and it's type is

text_std_token_lower_case

Tokenizer are:

<fieldType name="text_std_token_lower_case" class="solr.TextField" positionIncrementGap="100" multiValued="true">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory" />
  </analyzer>
</fieldType>

What are you trying to match in that string? Usually you are using Solr to avoid the need of regular expressions. The idea is to create a tokenizer that produces the tokens (words) you want to find. — cheffe
Ah, that <pre> ... </pre> around your query was an attempt to format your question, right? It is not really part of your query. — cheffe

Gus Gus · Accepted Answer · 2016-03-24T23:52:41

The tokenizer you have chosen (StandardTokenizerFactory) ignores punctuation characters. You can see this if you go to the analyisis page in the Solr admin UI. This will effect the tokenization of both your query and your field. You will need a tokenizer that does not omit punctuation.

One possible option is to use the Regular Expression Tokenizer documented on the Solr wiki (https://cwiki.apache.org/confluence/display/solr/Tokenizers) Perhaps you are looking for something like this?

<analyzer>
  <tokenizer class="solr.PatternTokenizerFactory" pattern="(>?<(PUT|GET|POST):)|>\s"/>
</analyzer>

That may require some tweaking if the urls can contain > characters that are not % encoded, or HEAD is possible etc. I am not confident that this will perform well however since regular expressions can become expensive. If this bogs things down you might need to write your own tokenizer.

Not able to exclude partial string in solr 5.3.1?

1 Answers