How to search for an indexed phrase within a given query

Question

Given a freeform query from a user, I am trying to determine whether it contains a location phrase.

Example: Given the freeform query "new york style pizza in san francisco ca", and given an index of documents containing location phrases such as "denver co", "miami fl", "new york city ny", "san francisco ca", "paris france", etc., the match would be to the document containing the location phrase "san francisco ca".

The index containing the location phrases also contains allowable permutations, in separate documents. In the above example, I may have "san francisco ca", "san francisco california", and possibly others such as "sf ca", "bay area ca", and so forth, all as separate documents within the index. Casing and punctuation would be discarded up front, so the query "New York style PIZZA, in san francisco, ca" would become "new york style pizza in san francisco ca".

I should also mention, if there is a better or required way to index the locations to make this work for a given type of query, such as having the "city" and "state" and "country" in different fields, I can do that too, and I'm very open to suggestions.

What I've tried so far:

Plain old match query. Appears to work best, but ignores ordering... "san francisco ca" is a match, whereas "ca francisco san" should not match. Also ignores position.
Phrase matching. Does not work at all, because I get no matches due to the extra terms ("new york style pizza in") in the input query.
Multi-field match, cross_fields option. Same problem as match query; ignores ordering and position. This was attempted with a version of the index where "city" and "state" and so forth were different fields.
Percolating. Could not get to work at all. The call GET .../_percolate retrieves ALL documents in the index. Also, building the .percolator type was painfully slow and eventually crashed my instance (JVM memory 99%), while doing so with the bulk api. I have about 1M locations in my database and I think that's too many for percolator, which crashed consistently at around 120K locations. From what I've read, I don't think this is an appropriate use case for percolator, but not sure.

What I haven't tried, and why:

Shingles. The number of terms in a given location is variable (i.e. "dallas texas" vs "san francisco california" vs "new york city new york"), and shingles appear to work on a specific number of terms.
query_string. I don't want to require users to place phrases within double-quotes. I also don't want the query language (OR, AND, etc.). Also, ignores position.

I've spent 3-4 days banging away at this problem and would really appreciate some gentle guidance. Sample query/index/mappings would be great, but even just letting me know what type of query (and indexing and mapping) I should use would be tremendously helpful, so I can at least "bark up the right tree"!

I'm open to using other tools in combination with ES, as long as they're open-source, freely available, and reasonably well supported & used. The location database contains ~1M records.

BONUS: I'm making the assumption that the location phrase, if any, will be toward the end of the query. Some way to sense that or boost results accordingly would be great. Note I don't want to make this an absolute requirement; if a user submits the query "i want san francisco ca pizza places having new york style pizza" the only valid location phrase given the previously described index is "san francisco ca" and that should be the match.

BONUS 2X: I have the population information for each location. Some way to boost result slightly for higher population would be great too (I've tried function_score with field_value_factor function and ln1p modifier, and it appears to work well, but not sure how that would work if I end up using percolator).

BONUS 3X!: Accommodating slight typos, for example "san francsco" would be great.

I'm using ElasticSearch 1.3.2.

THANK YOU!!

EDIT: Just to be crystal clear, I am looking for a phrase search, when the indexed phrase is shorter than the query, as nicely described here, but unfortunately not fully solved:

Solr: Phrase search when indexed phrase is shorter than the query

What's the thing with the percolator? You have documents in your index that you want to match when given a certain query. It doesn't have anything to do with the percolator! With percolator you index queries and send documents to ES; ES will give you back the queries that matched. — Andrei Stefan
Your first "BONUS" requirement is a bit exaggerated, imo. You are, basically, asking ES to "understand" the query: usually locations are at the end of the query, buuut if two locations are both at the end and at the start of the query then reject the one at the end and choose the one at the start of the query :-). This sounds like an AI (artificial intelligence) like search engine. — Andrei Stefan
Even if, assuming, ES is "smart" enough to "sense" these subtleties in the English language written query, it will still not work because you basically want to match documents to your query, not vice-versa. Also, my question above about percolator still stands. — Andrei Stefan
Hi Andrei, thanks for the comments. The reasoning behind considering percolator is that I would index queries such as "san francisco ca", "denver co", etc., and the input freeform query would be the document to submit to percolator such as "new york pizza in san francisco ca". In any case I think it's moot because percolator doesn't appear to scale to more than 100K or so queries, unless I'm doing something wrong. — yahermann
As to ES understanding the query, and the "bonus" requirement: I don't expect AI to understand English. In the event of two or more location phrases matching the input query, I am making the assumption that the correct location query will be the furthest toward the end, hence the ask for the bonus. This assumption eliminates the need to parse, understand, "sense-make" the English language, and therefore simplifies the problem. — yahermann

Andrei Stefan Andrei Stefan · Accepted Answer · 2014-11-07T10:03:38

Here are some suggestions, even if I have some doubts I understand your requirements correct.

The basic idea is to manipulate what you put in your index (locations) since you want to match something larger than what you actually store in your documents. Also, I want to emphasize that I don't think this will be a black-white situation where you either get one (CORRECT) answer or no answer at all. There will always be a "score" for matches.

Another point is that you need to know how to manipulate your locations so that, given what queries you predict people will use, those manipulations will help you in most cases (not all cases). Better said, the combination of indexed locations and manipulations you performed on them will give you higher chances to match most of the queries.

Here are some concrete ideas:

Use shingles. I believe this is the only option for you to have the notion of ordered terms. You said that you have a free form query. This means in your query you want to put just that query nothing else, not being divided into terms, not having stop words removed or things like that. This means you cannot use span_near which can give you order.

With shingles you also get rid of those situations where user inputs "ca francisco san".
First locations manipulation idea: store the full location name, as well (besides shingles above). This will give you a bit higher score for those queries that actually match entirely your location documents. And, since I've seen from your examples you have multiple combinations of locations, there are high chances the "quality" of your "location" index to give you good matching results.

  "settings": {
    "analysis": {
      "filter": {
        "my_shingle_filter": {
          "type": "shingle",
          "min_shingle_size": 2,
          "max_shingle_size": 2,
          "output_unigrams": true // this is true for situations where you have "paris france" in locations but user searches for "paris"
        }
      },
      "analyzer": {
        "my_shingle_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_shingle_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "locations": {
      "properties": {
        "name": {
          "type": "string",
          "analyzer": "my_shingle_analyzer",
          "fields": {
            "full": {
              "type": "string",
              "analyzer": "keyword"
            }
   }}}}}

Use mapping transformation to improve the quality of your locations index. Meaning, those manipulations I mentioned above - they will add additional fields to your index (just like the name.full above) based on a prediction related to the queried terms.

First example is derived from one of your query samples: "new york style pizza in san francisco ca". For each location in your index add another field that should have the in prefix: in san francisco, in new york etc.

"transform": [
        {
        "script": "full_plus_in = 'in ' + ctx._source['name']; ctx._source['name.full_plus_in'] = full_plus_in",
        "lang": "groovy"
        }
...

Second example is by adding places suffix to a new field in your mapping. Assuming here that queries like "san francisco places for new your style pizza" can be considered frequent in your predictions:

{"script": "full_plus_places = ctx._source['name'] + ' places'; ctx._source['name.full_plus_places'] = full_plus_places",
        "lang": "groovy"}

Putting it all together here is a preliminary mapping:

{
  "settings": {
    "analysis": {
      "filter": {
        "my_shingle_filter": {
          "type": "shingle",
          "min_shingle_size": 2,
          "max_shingle_size": 2,
          "output_unigrams": true
        }
      },
      "analyzer": {
        "my_shingle_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_shingle_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "locations": {
      "transform": [
        {
        "script": "full_plus_in = 'in ' + ctx._source['name']; ctx._source['name.full_plus_in'] = full_plus_in",
        "lang": "groovy"
        },
        {"script": "full_plus_places = ctx._source['name'] + ' places'; ctx._source['name.full_plus_places'] = full_plus_places",
        "lang": "groovy"}
        ],
      "properties": {
        "name": {
          "type": "string",
          "analyzer": "my_shingle_analyzer",
          "fields": {
            "full": {
              "type": "string",
              "analyzer": "keyword"
            },
            "full_plus_in": {
              "type": "string",
              "analyzer": "keyword"
            },
            "full_plus_places": {
              "type": "string",
              "analyzer": "keyword"
            }
          }
        }
      }
    }
  }
}

Test data:

{"index":{}}
{"name":"denver co"}
{"index":{}}
{"name":"miami fl"}
{"index":{}}
{"name":"new york city ny"}
{"index":{}}
{"name":"san francisco ca"}
{"index":{}}
{"name":"paris france"}
{"index":{}}
{"name":"bay area ca"}
{"index":{}}
{"name":"dallas texas"}
{"index":{}}
{"name":"san francisco california"}
{"index":{}}
{"name":"new york city new york"}

Sample query:

{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "name": "i want san francisco ca places having new york style pizza"
          }
        }
      ],
      "should": [
        {"match": {
          "name.full": "i want san francisco ca places having new york style pizza"
        }},
        {"match": {
          "name.full_plus_in": "i want san francisco ca places having new york style pizza"
        }},
        {"match": {
          "name.full_plus_places": "i san francisco ca places having new york style pizza"
        }}
      ]
    }
  }
}

And the first matching location should be the best (considering the score it got).

How to search for an indexed phrase within a given query

1 Answers