I have a solr field called verbatim which contains sentences along with mobile numbers inside. I am using text_general data type for verbatim.
The requirement is the verbatim field should not be searchable on mobile number(format XXX-XXX-XXXX). The following are ways I was thinking.
Before sending to solr, use pattern matching for phone number and replace the number with "" and then index normally. But this means that we are modifying the content. And also as the records are in millions, doing so in java for every record, could land in extra time consumption.
Allow the data to be sent to Solr, and use pattern filters in schema.xml for the field definition(text_general_vision) to identify phone number like below. But I am able to still search with XXX or XXX-XXX-XXXX. Any help to identify the issue is appreciated. Thanks in advance.
<fieldType name="text_general_vision" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.PatternReplaceFilterFactory" pattern="\\d{3}-\\d{3}-\\d{4}" replacement="" replace="all" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.PatternReplaceFilterFactory" pattern="\\d{3}-\\d{3}-\\d{4}" replacement="" replace="all" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>