2
votes

I am using a solr string field to provide exact matching on a text field (I've looked at text field type and don't believe it will work for me - I want to exactly reproduce Sql's LIKE functionality, including spaces and wildcards).

My problem is when I index large text fields, Solr will not return any data when searching for these fields. The limit appears to be int16.max.

As a test, I created an index with an id field and a string field. If the id field is "1" and the string field contains 40,000 characters:

  • id:1 will return both fields with the 40,000 characters showing that it did get indexed.
  • string:* returns no results

If the string field only contains 30,000 characters everything seems to work fine.

I cannot find any documentation that states this is a limit, nor can I find any way around it, as the maxFieldLength was removed in 4.0 and the string field does not support analyzers.

Has anyone else run into this problem or found a workaround?

1
I am not sure about this but this can be considered a workaround. Try using the "Text" fieldType along with "KeywordTokenizer" which in turn will work like your "string" field..Mavellin

1 Answers

2
votes

I spent a lot of time on this, so I am posting the solution I came up with in case anyone else runs into this problem. Using the text field type with the KeywordTokenizer did work like the string field, right down to the length limit that I ran into with the string field type.

Ultimately, I created a custom tokenizer after reading this thread with a couple of changes:

  1. He wanted the standard behavior, so his is based on the StandardTokenizer, whereas I wanted it to act like a string field. I first tried using the KeywordTokenizer, but still ran into limits, so ultimately I based mine off of the WhitespaceTokenizer (more below).

  2. The code there is out of date and doesn't work with Solr 4.0.

The code for the WhitespaceTokenizer is very short, and it contains a method called isTokenChar the returns !Character.isWhitespace(c);. I simply changed this to always return true. After that I created a TokenizerFactory to return it and referenced it in schema.xml the same way that the linked thread did.

MyCustomTokenizer.java:

package custom.analysis;

import java.io.Reader;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.util.CharTokenizer;
import org.apache.lucene.util.AttributeSource;
import org.apache.lucene.util.Version;

public final class MyCustomTokenizer extends CharTokenizer {

  public MyCustomTokenizer(Version matchVersion, Reader in) {
    super(matchVersion, in);
  }

  public MyCustomTokenizer(Version matchVersion, AttributeSource source, Reader in) {
    super(matchVersion, source, in);
  }

  public MyCustomTokenizer(Version matchVersion, AttributeFactory factory, Reader in) {
    super(matchVersion, factory, in);
  }

  @Override
  protected boolean isTokenChar(int c) {
    return true; //!Character.isWhitespace(c);
  }
}

MyCustomTokenizerFactory.java:

package custom.analysis;

import org.apache.lucene.analysis.util.TokenizerFactory;
import org.apache.lucene.analysis.core.KeywordTokenizerFactory;
import org.apache.lucene.analysis.core.KeywordTokenizer;

import java.io.Reader; 
import org.apache.lucene.util.Version;


public class MyCustomTokenizerFactory extends TokenizerFactory {
  public MyCustomTokenizer create(Reader input) { 
    final MyCustomTokenizer tokenizer = new MyCustomTokenizer(Version.LUCENE_40, input); 
    return tokenizer; 
  } 
} 

schema.xml:

<fieldType name="text_block" class="solr.TextField" positionIncrementGap="100"> 
   <analyzer> 
     <tokenizer class="custom.analysis.MyCustomTokenizerFactory" />        
   </analyzer> 
</fieldType> 

Using this approach I was able to index large text fields (>100k characters) with functionality like the string field. If someone finds a better way, please post it!