I use GAE NDB Python
Approach 1:
# both models below have similar properties (same number and type)
class X1(ndb.Model):
p1 = ndb.StringProperty()
::
class X2(ndb.Model):
p1 = ndb.StringProperty()
::
def get(self):
q = self.request.get("q")
w = self.request.get("w")
record_list = []
if (q=="a"):
qry = X1.query(X1.p1==w)
record_list = qry.fetch()
elif (q=="b"):
qry = X2.query(X2.p1==w)
record_list = qry.fetch()
Approach 2:
class X1(ndb.Model):
p1 = ndb.StringProperty()
::
def get(self):
q = self.request.get("q")
w = self.request.get("w")
if (q=="a"):
k = ndb.Key("type_1", "k1")
elif (q=="b"):
k = ndb.Key("type_2", "k1")
qry = X1.query(ancestor=k, X1.p1==w)
record_list = qry.fetch()
My Questions:
Which approach is better in terms of query performance when I scale up the entities
Would there be significant impact on query performance if I scale up the ancestors (in the same hierarchy level horizontally) to 10,000 or 1,00,000 in approach 2
Is this application the correct use case for ancestor
Context:
This project is for understanding GAE better and the goal is to create an ecommerce website like amazon.com where I need to query based on a lot many(10) filter conditions(like, price range, brand, screen size, and so on). Each filter condition has few ranges(like, there could be five price bands); multiple ranges of a filter condition could be selected simultaneously. Multiple filter conditions could be selected just like on amazon.com left pane.
If I put all the filter conditions in the query in the form of AND, OR connected expression, it would take huge amount of time for scaled data sets even if I use query cursor and fetch by page.
To overcome this, I thought I would store the data in entities with parent as a string. The parent would be a cancatenation of the the different filters options which the product matches. There would be a lot of redundancy as I would store the same data in several entities for all the combinations of filter values which it satisfies. The disadvantage of this approach is that each product data is being stored multiple times in different entities(much more storage); but I was hoping to get a much better query performance(<2 seconds) since now my query string would contain only one or two AND or OR connected elements apart from ancestor. The ancestor would be the concatenation of the filter conditions which the user has selected to search for a product
Please let me know if I am not clear.. This is just an experimental approach that I am trying.. Another approach would have been to cache the results through a cron job periodically..
Any other suggestion to achieve a good query performance for such a website would be highly appreciated..
UPDATE(NEW STRATEGY):
i have decided to go with a model with some boolean properties(flags) for each range of each category(total such property per entity is ~14).. for one category, which had two possible values, I have three models(one having all entities of with either of the two values, and the other two for entites with each value).. so there is duplication(same data could be store twice in two entities).. also my complete product data model is a separate one.. the above model contains a key to this complete model..
i could not do away with Query class and write my own filtering(i actually did that with good success initially).. the reason is that i need to fetch results page by page(~15 results).. and i need to sort them too.. if i fetch all results and apply my own filtering, with large data set the fetching of all results takes a huge amount of time because of the large size of the results returned..
the initial development server results look good.. query execution time is <3 seconds for ~6000 matched entities.. (though i wished it to be ~1 second).. need to scale up the production datastore to test there..