I spent a lot of time on this, so I am posting the solution I came up with in case anyone else runs into this problem. Using the text field type with the KeywordTokenizer did work like the string field, right down to the length limit that I ran into with the string field type.
Ultimately, I created a custom tokenizer after reading this thread with a couple of changes:
He wanted the standard behavior, so his is based on the StandardTokenizer, whereas I wanted it to act like a string field. I first tried using the KeywordTokenizer, but still ran into limits, so ultimately I based mine off of the WhitespaceTokenizer (more below).
The code there is out of date and doesn't work with Solr 4.0.
The code for the WhitespaceTokenizer is very short, and it contains a method called isTokenChar
the returns !Character.isWhitespace(c);
. I simply changed this to always return true. After that I created a TokenizerFactory to return it and referenced it in schema.xml the same way that the linked thread did.
MyCustomTokenizer.java:
package custom.analysis;
import java.io.Reader;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.util.CharTokenizer;
import org.apache.lucene.util.AttributeSource;
import org.apache.lucene.util.Version;
public final class MyCustomTokenizer extends CharTokenizer {
public MyCustomTokenizer(Version matchVersion, Reader in) {
super(matchVersion, in);
}
public MyCustomTokenizer(Version matchVersion, AttributeSource source, Reader in) {
super(matchVersion, source, in);
}
public MyCustomTokenizer(Version matchVersion, AttributeFactory factory, Reader in) {
super(matchVersion, factory, in);
}
@Override
protected boolean isTokenChar(int c) {
return true; //!Character.isWhitespace(c);
}
}
MyCustomTokenizerFactory.java:
package custom.analysis;
import org.apache.lucene.analysis.util.TokenizerFactory;
import org.apache.lucene.analysis.core.KeywordTokenizerFactory;
import org.apache.lucene.analysis.core.KeywordTokenizer;
import java.io.Reader;
import org.apache.lucene.util.Version;
public class MyCustomTokenizerFactory extends TokenizerFactory {
public MyCustomTokenizer create(Reader input) {
final MyCustomTokenizer tokenizer = new MyCustomTokenizer(Version.LUCENE_40, input);
return tokenizer;
}
}
schema.xml:
<fieldType name="text_block" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="custom.analysis.MyCustomTokenizerFactory" />
</analyzer>
</fieldType>
Using this approach I was able to index large text fields (>100k characters) with functionality like the string field. If someone finds a better way, please post it!