Following Jim's answer I did further digging, sharing my findings here as others may find them useful.
I congratulated myself for deciding from day one to build and use a generic wrapper class for all my datastore entities, further inherited by specific classes for each entity kind. That is instead of operating with model classes directly inheriting ndb.Model
.
I already had a db_data
@property
in this generic class for reading the entity from the datastore on-demand, it was easy to plug in my own entity memcache-based caching scheme and tracking code to determine if the reads come from memcache or from ndb and if they're done inside transactions or not. This is how it looks like:
@property
def db_data(self):
if not hasattr(self, self.attr_db_data):
db_data = None
if self.readonly or not ndb.in_transaction():
# noinspection PyTypeChecker
db_data = memcache.get(self.memcache_key)
if db_data:
if not isinstance(db_data, self.db_model):
logging.error('mismatched cached db_data kind, expected %s got type %s %s' %
(self.kind, type(db_data), db_data))
db_data = None
else:
if self.trace:
logging.debug('%s from memcache' % self.lid)
self.cached = True
if not db_data:
if hasattr(self, self.attr_db_key):
db_data = self.db_key.get()
elif hasattr(self, self.attr_key_id):
db_data = self.db_model.get_by_id(id=self.key_id)
else:
raise RuntimeError('missing db_data, db_key and key_id')
if db_data:
# use ndb model names as strings in the list below. TODO: don't leave them on!
show_trace = self.kind in ['']
if self.trace or show_trace:
if show_trace:
self.print_stack(4)
logging.debug('%s from datastore' % self.lid)
self.cached = False
# noinspection PyTypeChecker
memcache.set(self.memcache_key, db_data)
setattr(self, self.attr_db_data, db_data)
return getattr(self, self.attr_db_data)
With this in place I uncovered:
- a lot of unnecessary reads from
ndb
, most notably the sequence configuration entities which did not change at all during the entire sequence and were accessed hundreds of times
- some reads for the same entities repeated inside the same request
- most important: the details for my many transactions failing with
TransactionFailedError(The transaction could not be committed. Please try again.)
or TransactionFailedError: too much contention on these datastore entities. please try again
, caused by all these unnecessary reads from many transactions executed in parallel. I knew (or rather suspected) the root cause - see related Contention problems in Google App Engine - but with these details info I can now actually work towards reducing them.
For the write side, I added to the generic class a db_data_put()
method (with many checks and tracking support) and replaced all the .db_data.put()
calls scattered all over the app code with it. This is how it looks like:
def db_data_put(self, force=False):
#assert ndb.in_transaction() TODO: re-enable after clearing all violations
self.db_key = self.db_data.put()
show_trace = False
if not ndb.in_transaction():
logging.error('%s: db_data_put called outside transaction, readonly=%s' % (self.lid, self.readonly))
show_trace = True
if self.readonly:
logging.error('%s: db_data_put called with readonly set' % self.lid)
show_trace = True
if force:
if self.trace:
logging.warn('%s: db_data_put called with force arg set' % self.lid)
if self.cached:
logging.error('%s: db_data_put called with cached data' % self.lid)
show_trace = True
if self.put_count:
logging.warn('%s: db_data_put already called %d time(s)' % (self.lid, self.put_count))
show_trace = True
self.put_count += 1
if self.update_needed:
self.update_needed = False
elif not force:
if self.trace:
logging.warn('%s: db_data_put called without update_needed set' % self.lid)
show_trace = True
# noinspection PyTypeChecker
memcache.set(self.memcache_key, self.db_data)
# use ndb model names as strings in the list below. TODO: don't leave them on!
show_kind = self.kind in ['']
if show_trace or show_kind:
self.print_stack(4)
if self.trace or show_trace or show_kind:
logging.debug('%s: db_data_put %s' % (self.lid, self.kind))
With it I uncovered some room for improvements and some lurking bugs:
- entities written outside transactions (room for data corruption)
- entities written multiple times in response to the same request (thus violating the max 1 write/second/entity group datastore limit)
- several hot requests occasionally clustering very close to each-other, thus contributing to the contention problem and, when writing the same entities, also violating the 1 write/sec limit.
After setting the self.readonly
flag to all sequence config objects and a couple of more often unnecessarily read ones (to enable caching even inside transactions), serializing the hot
writing requests and fixed the most critical bugs found I re-tried the clean measurement test:
- the entire sequence execution time dropped to ~13 minutes, probably from serializing the
hot
transactions which helped reduce the "noise" of the sequence and the overall number of logic state transitions and push queue tasks
- the number of data contention-related transaction failures dropped by ~60% - I attribute most of this to my private caching of entities not expected to change during a transaction (
ndb
has no clue if an entity accessed in a transaction will eventually be written to or not) and to serializing of the hot
transactions
- the Cloud Datastore Entity Writes dropped to 0.0055 Million Ops - probably due to the fewer sequence state transitions mentioned above
- the Cloud Datastore Read Operations dropped to 0.01 Million Ops - my caching helps during transactions
The rate of write to reads actually increased, invalidating my assumption that they should be comparable - the fact that most of the sequence activity happens in transactions matters in this case. I guess my assumption would only be valid in a non-transactional context.