0
votes

We have a SOLR v.5.5.0 server that we have loaded with documents. Each of the SOLR fields are copied into a composite field that we want to search against.

For example in our schema we have:

<field name="Key" type="int" indexed="true" stored="true" required="true"/>
<field name="_version_" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="Name" type="text_suggest_ngram" indexed="true" stored="true" required="false"/>
<field name="EmailAddress" type="text_email" indexed="true" stored="true" required="false"/>
<field name="Indexing" type="text_suggest_ngram" indexed="true" stored="true" multiValued="true"/>

There are about 20 different fields. Each field is copied into the index:

<copyField source="Key" dest="Indexing"/>
<copyField source="Name" dest="Indexing"/>
<copyField source="EmailAddress" dest="Indexing"/>

The custom field type is given the following tokenisers:

<fieldType name="text_email" class="solr.TextField"/>

<fieldType name="text_suggest_ngram" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.ASCIIFoldingFilterFactory"/>
            <filter class="solr.EnglishPossessiveFilterFactory"/>
            <filter class="solr.EdgeNGramFilterFactory" maxGramSize="20" minGramSize="2"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.ASCIIFoldingFilterFactory"/>
            <filter class="solr.EnglishPossessiveFilterFactory"/>
        </analyzer>
</fieldType>

Hence the Indexing field becomes a multi-value field. We use this field to do the searches on as we have a general search functionality that we should be able to search across all fields.

When we import data into SOLR and then do a search, some records work as expected. For example, if we search for an email address (e.g. select?q=Indexing%3Asomeone%40example.com), SOLR provides the correct document back.

However, on other documents, SOLR provides 0 results when searching (esp. on email addresses). What we see is a search for [email protected] SOLR finds no documents, but changing the query to secondexample SOLR finds the document. Changing the query to secondexample@e SOLR finds no documents. If we do a field search against the field EmailAddress (select?q=EmailAddress%3Asecondexample%40example.com) then the search succeeds as expected.

We don't want to encode the search for specific named fields as the field names are subject to change and changing our search service each time is undesirable.

Is there anyway to find out why SOLR does not search multi-value fields correctly?

Update Sample JSON document (content fuzzed for security)"

{
    "Phone": "555",
    "IndexText": [
        "555",
        "7854",
        "",
        "Main App",
        "16",
        "Life MTG L",
        "New MTG LL",
        "Application",
        "574",
        "574",
        "[email protected]",
        "",
        "",
        "M M S N",
        "Open",
        "P",
        "3876 K E 4 O N W 2619 S B",
        "",
        "A",
        "6055 C P E 32 L S C P B G 1501 S B",
        "S I N",
        "1597456 1254735"
    ],
    "Id": "7854",
    "Name": "Open",
    "WP": "",
    "OK": "16",
    "HP": "574",
    "LK": 1048808,
    "FN": "",
    "PN": "",
    "TN": "",
    "FN2": "MS",
    "LN2": "M M S N",
    "CL": "2",
    "Type": "P",
    "Laddr": "3876 K E 4 O N W 2619 S B",
    "EmailAddress": "[email protected]",
    "LES": "A",
    "PA": "6055 C P E 32 L S C P B G 1501 S B",
    "LIT": "S I N",
    "S": "N",
    "Acc": "1597456 1254735",
    "_version_": "1557490405902123010",
    "score": 11.771251
}

The fields and content has been edited from real data, but it gives the idea. The field names and content are longer words. This is taken from the SOLR admin search interface.

1
looks weird, could you show sample document? - Mysterion
Note that "[email protected]" is longer than your maxGramSize. - femtoRgon
@Mysterion - Added a sample document. The documents contain personal information so can't post the real data. - user626201
@femtoRgon - does the maxGramSize affect the full text search in that it only matches the prefix? - user626201

1 Answers

0
votes

Ok - so there appears to be two errors with our configuration.

  1. Gram size on EdgeNGramFilterFactory to small

As @femtoRgon points out, the gram size is incorrect. Increasing the gram size then fixes up the full email address search correctly. If we search for the full email address, Solr now correctly finds the document.

  1. Incorrect Email token on partial email addresses

The solr.UAX29URLEmailTokenizerFactory does not tokenize partial email addresses correctly on Solr 5.5.0. When using the Solr Query analyser on the query secondexample@e:

UAXURLET
text                        secondexample                               e
raw_bytes                   [73 65 63 6f 6e 64 65 78 61 6d 70 6c 65]    [65]
start                       0                                           14
end                         13                                          15
positionLength              1                                           1
type                        <ALPHANUM>                                  <ALPHANUM>
position                    1                                           1

Even though this is an email address the tokenizer generates a <ALPHANUM> type and not <EMAIL> type.

Seeing that our requirement is prefix searching, changing the tokenizer to KeywordTokenizerFactory means that we now get the full keyword to do prefix matches on.

On a side note, the Solr Admin Query analyzer is quite powerful to use (learnt something new) when it comes to these things.