5
votes

I am using Google Cloud Datastore through the Python library in a Python 3 flexible app engine environment. My flask application creates an object and then adds it to the datastore with:

ds = datastore.Client()
ds.put(entity)

In my testing, each call to put takes 0.5-1.5 seconds to complete. This does not change if I make two calls immediately one after the other like here.

I am wondering if the complexity of my object is the problem. It is multi-layered something like:

object = {
    a: 1,
    ...,
    b: [
        {
            d: 2,
            ...,
            e: {
                h: 3
            }
        }
    ],
    c: [
        {
            f: 4,
            ...,
            g: {
                i: 5
            }
        }
    ]
}

which I am creating by nesting datastore.Entity's, each initialised with something like:

entity = datastore.Entity(key=ds.key(KIND))
entity.update(object_dictionary)

Both lists are 3-4 items long. The JSON equivalent of the object is ~2-3kb.

Is this not the recommended practice? What should I be doing instead?

More info:

I do not currently wrap this put of an Entity in a transaction. put is just a thin wrapper over put_multi. put_multi appears to create a batch, send the Entity, then commit the batch.

I do not specify the object's "Name/ID" (title from datastore online console). I allow the library to decide that for me:

datastore.key(KIND)

where KIND is just a string specifying my collection's name. The alternative to this would be:

datastore.key(KIND, <some ID>)

which I use for updating objects, rather that here where I am initially creating the object. The keys generated by the library are increasing with time, but not monotonically (e.g: id=4669294231158784, id=4686973524508672).

I am not 100% sure of the terminology of what I am doing ("are entities are in the same entity group, or if you use indexed properties"), but people seem to refer to the process as an "Embedded Entity" (i.e. here). In the datastore online console, under the entities section I only have a single "kind", not multiple kinds for each of my sub objects. Does that answer your question, or can I find this out somehow?

I only have one index on the collection, on a separate ID field which is a reference to another object in a different database for cross database lookup.

2
Although I don't have hands-on experience with this particular Datastore library, I don't believe that this is an issue with the size of the list or of its items. Could you elaborate if you use transactions, if the keys are monotonically increased (e..g 1, 2, 3, 4), if the entities are in the same entity group, or if you use indexed properties (or a composite index) where values are very close to each other (e.g. timestamps). - Ani
Thanks for helping on this @Ani! I have updated the question with the best answers I have to your questions. Let me know what more I can do. - Jon G
Dan's answer is best-practice and I strongly recommend to make such changes if not already done. However, in NDB calls (the Datastore lib for Python Standard apps) I normally see write ops within 2 or 3 digit ms, so I'm surprised how slow your ops are and I don't believe that's to be expected. Do you run your app and the datastore in the same location? - Ani
The reason asking for these details in my first comment is that such dense sequences of either keys or (indexed) property values may cause latency or even contention errors. Under the hood, Cloud Datastore (and also Storage for paths/file names) uses such values to decide which "tablet" stores the data. (e.g. tablet 1 for IDs starting with 4669). For best latency and scaling those shouldn't be too close. For keys it's best to let Datastore generate IDs as you already have done. - Ani
I can see that my app engine flexible region is set to us-east1. I can't see a region for datastore in the console or gcloud datastore help. Where do I find the datastore's location? - Jon G

2 Answers

2
votes

You can increase the performance of multiple consecutive writes (reads as well) by using Batch operations:

Batch operations

Cloud Datastore supports batch versions of the operations which allow it to operate on multiple objects in a single Cloud Datastore call.

Such batch calls are faster than making separate calls for each individual entity because they incur the overhead for only one service call. If multiple entity groups are involved, the work for all the groups is performed in parallel on the server side.

client.put_multi([task1, task2])
2
votes

Aside from the batching recommendation in the other answer, there are other practice that would decrease your "put" time.

When you perform a "write" on Datastore, you are actually writing your data multiple times to multiple tables (indices) to increase performance. Datastore is optimized for query-time performance by sacrificing a bit of writing time efficiency and storage. So for example, if you indexed three normal fields, every write basically updates three (sorted) tables. Normally fields that will not be queried should not be indexed, this will save you time and money.

The effect of "over-indexing" is even worse when you have repeated or nested fields because of the "exploding index" effect. Essentially, your data is "flattened" before they are stored so having multiple repeated fields will result in multiplicative increase in write cost and time.