Given a freeform query from a user, I am trying to determine whether it contains a location phrase.
Example: Given the freeform query "new york style pizza in san francisco ca", and given an index of documents containing location phrases such as "denver co", "miami fl", "new york city ny", "san francisco ca", "paris france", etc., the match would be to the document containing the location phrase "san francisco ca".
The index containing the location phrases also contains allowable permutations, in separate documents. In the above example, I may have "san francisco ca", "san francisco california", and possibly others such as "sf ca", "bay area ca", and so forth, all as separate documents within the index. Casing and punctuation would be discarded up front, so the query "New York style PIZZA, in san francisco, ca" would become "new york style pizza in san francisco ca".
I should also mention, if there is a better or required way to index the locations to make this work for a given type of query, such as having the "city" and "state" and "country" in different fields, I can do that too, and I'm very open to suggestions.
What I've tried so far:
- Plain old match query. Appears to work best, but ignores ordering... "san francisco ca" is a match, whereas "ca francisco san" should not match. Also ignores position.
- Phrase matching. Does not work at all, because I get no matches due to the extra terms ("new york style pizza in") in the input query.
- Multi-field match, cross_fields option. Same problem as match query; ignores ordering and position. This was attempted with a version of the index where "city" and "state" and so forth were different fields.
- Percolating. Could not get to work at all. The call GET .../_percolate retrieves ALL documents in the index. Also, building the .percolator type was painfully slow and eventually crashed my instance (JVM memory 99%), while doing so with the bulk api. I have about 1M locations in my database and I think that's too many for percolator, which crashed consistently at around 120K locations. From what I've read, I don't think this is an appropriate use case for percolator, but not sure.
What I haven't tried, and why:
- Shingles. The number of terms in a given location is variable (i.e. "dallas texas" vs "san francisco california" vs "new york city new york"), and shingles appear to work on a specific number of terms.
- query_string. I don't want to require users to place phrases within double-quotes. I also don't want the query language (OR, AND, etc.). Also, ignores position.
I've spent 3-4 days banging away at this problem and would really appreciate some gentle guidance. Sample query/index/mappings would be great, but even just letting me know what type of query (and indexing and mapping) I should use would be tremendously helpful, so I can at least "bark up the right tree"!
I'm open to using other tools in combination with ES, as long as they're open-source, freely available, and reasonably well supported & used. The location database contains ~1M records.
BONUS: I'm making the assumption that the location phrase, if any, will be toward the end of the query. Some way to sense that or boost results accordingly would be great. Note I don't want to make this an absolute requirement; if a user submits the query "i want san francisco ca pizza places having new york style pizza" the only valid location phrase given the previously described index is "san francisco ca" and that should be the match.
BONUS 2X: I have the population information for each location. Some way to boost result slightly for higher population would be great too (I've tried function_score with field_value_factor function and ln1p modifier, and it appears to work well, but not sure how that would work if I end up using percolator).
BONUS 3X!: Accommodating slight typos, for example "san francsco" would be great.
I'm using ElasticSearch 1.3.2.
THANK YOU!!
EDIT: Just to be crystal clear, I am looking for a phrase search, when the indexed phrase is shorter than the query, as nicely described here, but unfortunately not fully solved:
Solr: Phrase search when indexed phrase is shorter than the query