15
votes

This is a common exception I'm getting on my application's log daily, usually 5/6 times a day with a traffic of 1K visits/day:

db error trying to store stats
Traceback (most recent call last):
  File "/base/data/home/apps/stackprinter/1b.347728306076327132/app/utility/worker.py", line 36, in deferred_store_print_statistics
    dbcounter.increment()
  File "/base/data/home/apps/stackprinter/1b.347728306076327132/app/db/counter.py", line 28, in increment
    db.run_in_transaction(txn)
  File "/base/python_runtime/python_lib/versions/1/google/appengine/api/datastore.py", line 1981, in RunInTransaction
    DEFAULT_TRANSACTION_RETRIES, function, *args, **kwargs)
  File "/base/python_runtime/python_lib/versions/1/google/appengine/api/datastore.py", line 2067, in RunInTransactionCustomRetries
    ok, result = _DoOneTry(new_connection, function, args, kwargs)
  File "/base/python_runtime/python_lib/versions/1/google/appengine/api/datastore.py", line 2105, in _DoOneTry
    if new_connection.commit():
  File "/base/python_runtime/python_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 1585, in commit
    return rpc.get_result()
  File "/base/python_runtime/python_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 530, in get_result
    return self.__get_result_hook(self)
  File "/base/python_runtime/python_lib/versions/1/google/appengine/datastore/datastore_rpc.py", line 1613, in __commit_hook
    raise _ToDatastoreError(err)
Timeout: The datastore operation timed out, or the data was temporarily unavailable.

The function that is raising the exception above is the following one:

def store_printed_question(question_id, service, title):
    def _store_TX():
        entity = Question.get_by_key_name(key_names = '%s_%s' % \
                                         (question_id, service ) )
        if entity:
            entity.counter = entity.counter + 1                
            entity.put()
        else:
            Question(key_name = '%s_%s' % (question_id, service ),\ 
                          question_id ,\
                          service,\ 
                          title,\ 
                          counter = 1).put()
    db.run_in_transaction(_store_TX)

Basically, the store_printed_question function check if a given question was previously printed, incrementing in that case the related counter in a single transaction.
This function is added by a WebHandler to a deferred worker using the predefined default queue that, as you might know, has a throughput rate of five task invocations per second.

On a entity with six attributes (two indexes) I thought that using transactions regulated by a deferred task rate limit would allow me to avoid datastore timeouts but, looking at the log, this error is still showing up daily.

This counter I'm storing is not so much important, so I'm not worried about getting these timeouts; anyway I'm curious why Google App Engine can't handle this task properly even at a low rate like 5 tasks per second and if lowering the rate could be a possible solution.
A sharded counter on each question to avoid timeouts would be an overkill to me.

EDIT:
I have set the rate limit to 1 task per second on the default queue; I'm still getting the same error.

2
Deferred tasks aren't "lighter weight" than regular tasks in any sense except that they're easier to write. Under the hood, they're implemented with regular handlers. In any case, that would have no impact on the overhead of the transaction itself.Nick Johnson

2 Answers

9
votes

A query can only live for 30 seconds. See my answer to this question for some sample code to break a query up using cursors.

7
votes

Generally speaking, a timeout like this is usually because of write contention. If you've got a transaction going and you're writing a bunch of stuff to the same entity group concurrently, you run into write contention issues (a side effect of optimistic concurrency). In most cases, if you make your entity group smaller, that will usually minimize this problem.

In your specific case, based on the code above, it's most probably because you should be using a sharded counter to avoid stacking of serialized writes.

Another far less likely possibility (mentioned here only for completeness) is that the tablet your data is on is being moved.