1
votes

When I query my data for _all fields, Elasticsearch returns two documents (both with only one field in the document). But when I do the same query except I change the field being queried from _all to the name of one of the fields in the returned documents, Elasticsearch returns nothing. This seems to occur with a query_string query as well as the match query shown here. Any ideas what this is occurring and how to fix it?

This is the mapping

analyzertestpatternsemi: {
  mappings: {
    content: {
      properties: {
        field: {
          type: string
          store: true
          term_vector: with_positions_offsets
          index_analyzer: analyzer_name
        }
        field2: {
          type: string
          store: true
          index_analyzer: analyzer_name
        }
      }
    }
  }
}

This is the settings

{
  analyzertestpatternsemi: {
    settings: {
      index: {
        uuid: _W55phRKQ1GylWU5JleArg
          analysis: {
            analyzer: { 
              whitespace: {
                type: custom
                fields: [
                  lowercase
                ]
                tokenizer: whitespace
              }
              analyzer_name: {
                preserve_original: true
                type: pattern
                pattern: ;
              }
            }
          }
          number_of_replicas: 1
          number_of_shards: 5
          version: {
          created: 1030299
          }
        }
      }
    }
  }

The Docs

{
  _index: analyzertestpatternsemi
  _type: content
  _id: 3
  _version: 1
  found: true
   _source: {
    field2: Hello, I am Paul; George
  }
}

and

{
  _index: analyzertestpatternsemi
  _type: content
  _id: 2
  _version: 1
  found: true
    _source: {
      field: Hello, I am Paul; George
  }
}

Getting the term vectors for _id gives

george and hello, i am paul

The "_all" query

curl -XGET http://localhost:9200/analyzertestpatternsemi/_search?
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "_all": {
              "query": "george",
              "type": "phrase"
            }
          }
        }
      ]
    }
  }
}

The "all" query results

{
  took: 2
  timed_out: false
  _shards: {
    total: 2
    successful: 2
    failed: 0
  }
  hits: {
    total: 2
    max_score: 0.4375
    hits: [
      {
        _index: analyzertestpatternsemi
        _type: content
        _id: 2
        _score: 0.4375
        _source: {
          field: Hello, I am Paul; George
        }
      }
      {
        _index: analyzertestpatternsemi
        _type: content
        _id: 3
        _score: 0.13424811
        _source: {
          field2: Hello, I am Paul; George
        }
      }
    ]
  }
}

*** Same query but searching in field: "field"

curl -XGET http://localhost:9200/analyzertestpatternsemi/_search?
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "field": {
              "query": "george",
              "type": "phrase"
            }
          }
        }
      ]
    }
  }
}

"field" query Results

{
  took: 0
  timed_out: false
  _shards: {
    total: 5
    successful: 5
    failed: 0
  }
  hits: {
    total: 0
    max_score: null
      hits: [ ]
  }
}

Same query but searching in field: "field2"

curl -XGET http://localhost:9200/analyzertestpatternsemi/_search?
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "field2": {
              "query": "george",
              "type": "phrase"
            }
          }
        }
      ]
    }
  }
}

"field2" query Results

{
  took: 0
  timed_out: false
  _shards: {
    total: 5
    successful: 5
    failed: 0
  }
  hits: {
    total: 0
    max_score: null
      hits: [ ]
  }
}
3

3 Answers

0
votes

The issue is your "pattern" tokenizer splits the text into hello, i am paul and george (notice the whitespace before "george"). To be able to match for george you need to get rid of that whitespace.

Here's one approach - define your own custom analyzer with a pattern tokenizer and a custom list of filters (where "trim" is the needed addition for trimming the whitespaces before and after the tokens):

{
  "mappings": {
    "content": {
      "properties": {
        "field": {
          "type": "string",
          "store": true,
          "term_vector": "with_positions_offsets",
          "index_analyzer": "analyzer_name"
        },
        "field2": {
          "type": "string",
          "store": true,
          "index_analyzer": "analyzer_name"
        }
      }
    }
  },
  "settings": {
    "index": {
      "uuid": "_W55phRKQ1GylWU5JleArg",
      "analysis": {
        "analyzer": {
          "whitespace": {
            "type": "custom",
            "fields": [
              "lowercase"
            ],
            "tokenizer": "whitespace"
          },
          "analyzer_name": {
            "type": "custom",
            "tokenizer": "my_pattern_tokenizer",
            "filter": ["lowercase","trim"]
          }
        },
        "tokenizer": {
          "my_pattern_tokenizer": {
            "type": "pattern",
            "pattern": ";"
          }
        }
      },
      "number_of_replicas": 1,
      "number_of_shards": 5,
      "version": {
        "created": "1030299"
      }
    }
  }
}
0
votes

I used the multi_term type to analyze and store the field in multiple ways. The documentation for it can be found here http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/mapping-multi-field-type.html One analyzer can give you the tokens you want either for a certain type of query or for aggregation, and the other could be for a different type of query on the same data.

I am not sure why this error (mentioned in the original question) occurred, but what I was trying to achieve was to use an analyzer to create tokens with ";" as the break between tokens. I wanted this so that I could do Top Hits aggregations basses on the tokens (the grouping of terms separated by ";"). But I wanted to be able to search/query the data with individual words (like the standard analyzer) rather than having to query an entire token (grouping of terms). To achieve this I just defined the "type" for "field" and "field2" as "multi_field" and then defined to sub-fields. One subfield used the "standard" analyzer and "analyzer_name" (the custom pattern analyzer). The field with the standard analyzer is the field that queries will be run against and the other field (with the "analyzer_name" analyzer) will be used for aggregations.

0
votes

The issue is actually with the query. The two tokens stored are "hello, i am paul" and "george".

Adding the "trim" filter to the analyzer "analyzer name" solved the issue with the query "george" not returning anything because without the "trim" analyzer the stored term was actually " george".

The issue (noted in the comment - by James on Nov 6 - associated with the answer by Adrei Stefan on Nov 5) with match queries not returning the document when the following were used in the query: "hello", "paul", "hello i am paul", "Hello I am Paul", and "Hello, I am Paul" is explained below.

*** The issue here is with the query. when using a match query with the "standard" analyzer (default analyzer). This means that the the query "hello" is searching for a token "hello" but the stored token is actually "hello, i am paul" and the query "hello i am paul" actually searches for the tokens "hello", "i", "anm", and "paul" which does not match any of the tokens stored in the fields.

In this situation Elasticsearch will only return the document if the term it is searching if "george" or "hello, i am paul". The document will be returned if you do term search with either of these two tokens or use them in a match query with the analyzer set to "keyword". You could also search "hello, i am paul", "george", "hello, i am paul; george", or any of those three with capital letters if you set the analyzer to "analyzer_name".