4
votes

I'm writing a GAE-based application that should allow users to filter items by several of their properties. Items are stored as NDB entities. Some of the props can be matched by standard query filters but some require a "full" (substring) text search for the whole thing to make sense. Additonally, some sensible ordering is required. It's perhaps best illustrated with a following contrived example:

class Product(ndb.Model) :
  manufacturer = ndb.StringProperty()
  model = ndb.StringProperty()
  rating = ndb.IntegerProperty(choices = [1, 2, 3, 4])
  features = ndb.StringProperty(repeated = True, choices = ['feature_1', 'feature_2'])
  is_very_expensive = ndb.BooleanProperty()
  categories = ndb.KeyProperty(kind = Category, repeated = True)

Product entities all have the same ancestor as their "container". A product can belong to one or more categories and the latter form a tree.

Now, users should be able to:

  • Narrow down products by selecting a category (single one will suffice)
  • Filter them by specifying a minimal rating and desired features
  • View exclusively products that are very expensive or those that are not (or view all)
  • Search for products by a piece of text from model and/or manufacturer fields
  • Have the final list ordered eg. by model name (ability to pick ordering would be ideal though).

All this at the same time, ie. filters and ordering should be seamlessly applied when search terms are provided.

The question is: how to achieve such functionality in a performat way using GAE?

There are going to be hundreds of thousands, or perhaps millions, of products in the database. The problem with Search API, when used together with NDB queries, is filtering the search results and perhaps ordering them.

Two solutions i've been thinking of:

  1. Add a repeated StringProperty to the Product model that would contain all searchable substrings (or at least prefixes) of words from manufacturer and model fields. It's easy and it works but i'm seriously concerned about performance. In my experiments i got on average 40-50 searchable word prefixes for each "Product".

  2. Use Search API exclusively for the task, utilizing advanced search queries. Eg. i can store product's categories (as IDs or paths) in a separate document field and use this field to obtain products belonging to a given category. It probably can be done but what concerns me here is the limit of 10,000 search results and various usage limitations/quotas. I'm also not sure about ordering of results.

Are there any other ways?

1
Search for a single peace of text or combinations? Why do you have a performance concern for solution-1. - voscausa
This solution means checking for an occurence of each search word in repeated StringProperty for each item. I'm simply not sure how good is Google at indexing such properties, especially when they're somewhat lengthy. After all, this kind of search is said to be best done with Search API, not NDB. - hauru
From the docs. A separate index record is created for each value. - voscausa

1 Answers

3
votes

I'd highly suggest not using GAE for this. I know it's probably not what you want to hear, but it does not match your use case very well and offer the flexibility I think you will want from a product search. It sounds like what you really want is something closer to a faceted search.

Here are the reasons why GAE is not a good match:

  1. You will very quickly hit cases of index explosions or otherwise possibly risk heavy performance degradation due to zig-zag queries if you use NDB. This article tries to make a case for what you're doing, but in practice we've found it doesn't scale at my day job beyond small datasets/few fields. The more ordering you introduce with the other filters, the more problems this will give you, not to mention if you start needing multiple inequalities withs sorts.

  2. GAE full-text search is slow and limited compared to other offerings. The query language itself is not so mature and flexible IMO. It's also not very cost/quota friendly. You mentioned you are concerned about quotas, and search eats through them easily.

  3. The substring approach will inflate the size of each record you save. Django non-rel has an indexer package that does exactly this, and it's not pretty. Not sure if you're using Django, but in any case you could adapt the code as it's open source. Inflating the record size is bad because unless you're using projection queries or keys only, you will be sending back a lot of unnecessary data over the wire.

Instead, I recommend you push your data to a data store that's more optimized for these kinds of queries. Here's an example outline of an architecture:

  • Search server on Google Compute Engine to reduce latency with App Engine. Not sure if there's a way to get things to be in the same geo location, but I would suspect you're better off hosting it here than Amazon in terms of latency. Obviously you might lose some speed here, but it may still be faster than the built-in GAE full-text search.

  • If you need a cluster to scale, you could use ElasticSearch. Be aware you will need to setup multi-cast properly on google compute engine if you do this. ElasticSearch provides a plug-in for this.

  • Create a background process that uses a push or pull queue depending on your data volume to update your search indexes. The frequency will depend on how "fresh" you need your data. The decision for push vs. pull will depend highly on your volume, but I'd recommend using a pull queue here with dedicated servers pushing to your search provider. You would have to do this anyway using the built-in full-text search.

  • Create a map reduce job that will push all the data to the search index. This is useful for both seeding the initial queue and for a periodic "refresh."

The downside to the above is you will dramatically increase the number of URL fetch calls you do and the data may not always be fresh. The latter is normal in most search situations, the former is probably still cheaper than the built-in full-text search depending on your volume. If the data rarely changes, you can do things like dump to Google Cloud Storage and import that way instead cheaper.