Comparison between using two Models and using one Model with entities with two ancestors in GAE NDB Python(design for amazon.com like website)

Question

I use GAE NDB Python

Approach 1:

# both models below have similar properties (same number and type) 
class X1(ndb.Model): 
    p1 = ndb.StringProperty() 
    :: 

class X2(ndb.Model): 
    p1 = ndb.StringProperty() 
    :: 

def get(self): 
    q = self.request.get("q") 
    w = self.request.get("w") 
    record_list = [] 
    if (q=="a"): 
        qry = X1.query(X1.p1==w) 
        record_list = qry.fetch() 
    elif (q=="b"): 
        qry = X2.query(X2.p1==w) 
        record_list = qry.fetch()

Approach 2:

class X1(ndb.Model): 
    p1 = ndb.StringProperty() 
    :: 

def get(self): 
    q = self.request.get("q") 
    w = self.request.get("w") 
    if (q=="a"): 
        k = ndb.Key("type_1", "k1") 
    elif (q=="b"): 
        k = ndb.Key("type_2", "k1") 
    qry = X1.query(ancestor=k, X1.p1==w) 
    record_list = qry.fetch()

My Questions:

Which approach is better in terms of query performance when I scale up the entities

Would there be significant impact on query performance if I scale up the ancestors (in the same hierarchy level horizontally) to 10,000 or 1,00,000 in approach 2

Is this application the correct use case for ancestor

Context:

This project is for understanding GAE better and the goal is to create an ecommerce website like amazon.com where I need to query based on a lot many(10) filter conditions(like, price range, brand, screen size, and so on). Each filter condition has few ranges(like, there could be five price bands); multiple ranges of a filter condition could be selected simultaneously. Multiple filter conditions could be selected just like on amazon.com left pane.

If I put all the filter conditions in the query in the form of AND, OR connected expression, it would take huge amount of time for scaled data sets even if I use query cursor and fetch by page.

To overcome this, I thought I would store the data in entities with parent as a string. The parent would be a cancatenation of the the different filters options which the product matches. There would be a lot of redundancy as I would store the same data in several entities for all the combinations of filter values which it satisfies. The disadvantage of this approach is that each product data is being stored multiple times in different entities(much more storage); but I was hoping to get a much better query performance(<2 seconds) since now my query string would contain only one or two AND or OR connected elements apart from ancestor. The ancestor would be the concatenation of the filter conditions which the user has selected to search for a product

Please let me know if I am not clear.. This is just an experimental approach that I am trying.. Another approach would have been to cache the results through a cron job periodically..

Any other suggestion to achieve a good query performance for such a website would be highly appreciated..

UPDATE(NEW STRATEGY):

i have decided to go with a model with some boolean properties(flags) for each range of each category(total such property per entity is ~14).. for one category, which had two possible values, I have three models(one having all entities of with either of the two values, and the other two for entites with each value).. so there is duplication(same data could be store twice in two entities).. also my complete product data model is a separate one.. the above model contains a key to this complete model..

i could not do away with Query class and write my own filtering(i actually did that with good success initially).. the reason is that i need to fetch results page by page(~15 results).. and i need to sort them too.. if i fetch all results and apply my own filtering, with large data set the fetching of all results takes a huge amount of time because of the large size of the results returned..

the initial development server results look good.. query execution time is <3 seconds for ~6000 matched entities.. (though i wished it to be ~1 second).. need to scale up the production datastore to test there..

What are you trying to do? something more expressive than X1 and X2 would help — Michael Técourt
@MichaelakoTecourt I have added the Context section in my question.. pl let me know any other needed information.. — gsinha
I'm kind of surprised by the lack of support by Google's app engine team on datastore modeling questions. This is the kind of classic design issue that has people not jumping on the GAE datastore band wagon right away. Some samples or blog posts would do. I might give it a shot — Michael Técourt

Michael Técourt Michael Técourt · Accepted Answer · 2014-03-27T17:15:09

EDIT after context definition:

Tough subject there. You have plenty of datastore limitations that can get in your way :

Write throughput (1 write/sec per Entity Group)
Query inequality filters limit
Cross entity group transactions at write time (duplicating your product in each "query filter" specific entity group )
Max entity size (1MB) if you duplicate whole products for every "query filter" entity

I don't have any "ready made" answer, just some humble advice based on common sense.

In my opinion your first solution will get overly complex as you add new filtering criterias, type of products, etc.

The problem with the datastore, and most "NoSQL" solutions, is that they tend to have few analytic/query features out of the box (they are not at the maturity level of RDBMS that have evolved for years), forcing you to compute results "by hand".

For your case, I don't see anything out of the box, and the "datastore query engine" is clearly not enough for such queries. Keep your data quite simple though, just store your products as entities with properties. If you have clearly different product categories, you may store them as different entity kinds -> I highly doubt people will run a "brand" query for both "shoes" and "food".

You will have to run a datastore query within the limitations to quickly get a gross result set, and refine it by hand (map reduce job, async task..) ... and then cache the result for as long as you can.

-> your aggressive cache solutions looks far better from a performance, cost and maintainability standpoint.

You won't be able to cache your whole product base, and some queries for rarities will take longer... like I said, I don't see any perfect answers here, just different tradeoffs for performance.

Just my 2 cents :) I'll be curious in what solution you end up adopting.

You typically use ancestors for data that is own by an entity.

For example :

A Book is your root entity, and it "owns" Page entities. A Page without a Book is meaningless. Book is the ancestor of Page.

A User is your root entity, and it "owns" BlogPost entities. A BlogPost without its Writter is quite meaningless. User is the ancestor of BlogPost.

If your two entities X1 and X2 share the same attributes, I'd say they are the same X entity, with just an additonal "type" attribute to determine if your talking about X Type1 or X type2.

Comparison between using two Models and using one Model with entities with two ancestors in GAE NDB Python(design for amazon.com like website)

1 Answers