0
votes

I have a solr field called verbatim which contains sentences along with mobile numbers inside. I am using text_general data type for verbatim.

The requirement is the verbatim field should not be searchable on mobile number(format XXX-XXX-XXXX). The following are ways I was thinking.

  1. Before sending to solr, use pattern matching for phone number and replace the number with "" and then index normally. But this means that we are modifying the content. And also as the records are in millions, doing so in java for every record, could land in extra time consumption.

  2. Allow the data to be sent to Solr, and use pattern filters in schema.xml for the field definition(text_general_vision) to identify phone number like below. But I am able to still search with XXX or XXX-XXX-XXXX. Any help to identify the issue is appreciated. Thanks in advance.

    <fieldType name="text_general_vision" class="solr.TextField" positionIncrementGap="100">
     <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.PatternReplaceFilterFactory" pattern="\\d{3}-\\d{3}-\\d{4}" replacement="" replace="all" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.PatternReplaceFilterFactory" pattern="\\d{3}-\\d{3}-\\d{4}" replacement="" replace="all" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>
    
1
When you say the field should not be searchable do you mean when searching across all default fields? One way I do this is by excluding certain fields from the query fields (qf). You field would still be searchable when specifying the field in the query, but when you search all fields it would not be searchable.Nathan Hall
the search would be on verbatim field only. if verbatim is "Crazy mobile 123-456-7899". The documents should be visible for terms creazy and mobile, but not for terms 123/456/7899/123-456-7899/1234567899.Ramzy

1 Answers

2
votes

The issue is that the filters you've provided run after tokenization. That means that it never sees the complete phone number, as it'll be split into separate tokens by the StandardTokenizer when separated by -.

You can apply a PatternReplaceCharFilter before tokenization happens, which will allow you to remove any pattern that matches the regular expression.

Keep in mind that you'll still be doing it for every record (as you'll have to do, either for every record or for every query - records are usually fewer than the number of queries, but YMMV), but the logic happens on the Solr side instead of having to keep each indexing method updated all the time.

Remember that the phone number will still be available if the field is stored, but that didn't seem to be an issue.