0
votes

I'm defining the Index schema. One of the field is "InvoiceNumber" which it can be something like "459" or "00459" or "P00459".

I want the text "00459" while indexing tokenize to 2 tokens "459" and the original "00459".

And the text "P00459", tokenize to 3 tokens "459", "00459" and the original "P00459".

Is there a way to define the custom analyzer for this?

1

1 Answers

0
votes

configuring pattern_capture token filter with appropriate regex is able to produce multiple tokens based on the same text while preserving the original text.

https://docs.microsoft.com/en-us/azure/search/index-add-custom-analyzers https://lucene.apache.org/core/4_10_3/analyzers-common/org/apache/lucene/analysis/pattern/PatternCaptureGroupTokenFilter.html

This is the example from the latter link "(https?://([a-zA-Z-_0-9.]+))" when matched against the string "http://www.foo.com/index" would return the tokens "https://www.foo.com" and "www.foo.com".