0
votes

I'm trying to understand how to structure keys so that data stored in the GAE NDB Datastore scales horizontally. To take a specific scenario, I'd like to ensure that in a multi-tenant application, tenant data is partitioned in such a way that there are no hot points as the number of tenants grows.

Given a simple model like

class Tenant(ndb.Model):
   name      = ndb.StringProperty  (required = True,  indexed = True)
   timestamp = ndb.DateTimeProperty(required = True,  indexed = True)
   tags      = ndb.StringProperty  (repeated = True,  indexed = True)

Do we have control over how the data is partitioned across storage nodes without overlap (as much as possible)?

2
If you are developing a multitenant application you should also consider using namespaces to partition the data from the application point of view - ie no bugs in queries leaking information. This has not about datastore scaling though. The other answers cover the datastore scaling.Tim Hoffman
Thanks for the tip - did not know thatkarthitect

2 Answers

2
votes

You don't have much control over it, but a few tips for scaling as related to high-replication data store and sharding:

  • Don't use monotonically increasing keys and don't insert in alphabetical order
  • Don't add indexes unless you actually need to query on them. Find ways of consolidating indexes that can be covered by a more broad index
  • Pick a predictable key if you can in your app so that when you need to, you can get the data by key instead of issuing a much for expensive search query. The difference is dramatic.

The general key is to avoid too much write contention and doing things that will break the sharding optimizations built-in. I'd be more worried about contention on things like counters and individual records than how the data is split up internally by shard as you can do more about the former than the latter.

I think you'd find this and this quite interesting. They relate to keys and the "hot tablet" issue. Don't over optimize. Worst case you can migrate your data with a map/reduce job or two, however it's still important to understand these issues to avoid something stupid from the start.

0
votes

Datastore does it automatically. You should trust in Google to manage this supremely well.