3
votes

I am a beginner to Datastore and I am wondering how I should use it to achieve what I want to do.

For example, my app needs to keep track of customers and all their purchases.

Coming from relational database, I can achieve this by creating [Customers] and [Purchases] table. In Datastore, I can make [Customers] and [Purchases] kinds.

Where I am struggling is the structure of the [Purchases] kind.

If I make [Purchases] as the child of [Customers] kind, would there be one entity in [Customers] and one entity in [Purchases] that share the same key? Does this mean inside of this [Purchases] entity, I would have a property that just keeps increasing for each purchase they make?

Or would I have one [Purchases] entity for each purchase they make and in each of these entities I would have a property that points to a entity in [Customers] kind?

How does Datastore perform in these scenarios?

3

3 Answers

5
votes

Sounds like you don't fully understand ancestors. Let's go with the non-ancestor version first, which is a legitimate way to go:

class Customer(ndb.Model):
    # customer data fields
    name = ndb.StringProperty()

class Purchase(ndb.Model):
    customer = ndb.KeyProperty(kind=Customer)
    # purchase data fields
    price = ndb.IntegerProperty

This is the basic way to go. You'll have one entity in the datastore for each customer. You'll have one entity in the datastore for each purchase, with a keyproperty that points to the customer.

IF you have a purchase, and need to find the associated customer, it's right there.

purchase_entity.customer.get()

If you have a Customer, you can issue a query to find all the purchases that belong to the customer:

Purchase.query(customer=customer_entity.key).fetch()

In this case, whenever you write either a customer or purchase entity, the GAE datastore will write that entity any one of the datastore machines running in the cloud that's not busy. You can have really high write throughput this way. However, when you query for all the purchases of a given customer, you just read back the most current data in the indexes. If a new purchase was added, but the indexes not updated yet, then you may get stale data (eventual consistency). You're stuck with this behavior unless you use ancestors.

Now as for the ancestor version. The basic concept is essentially the same. You still have a customer entity, and separate entities for each purchase. The purchase is NOT part of the customer entity. However, when you create a purchase using a customer as an ancestor, it (roughly) means that the purchase is stored on the same machine in the datastore that the customer entity was stored on. In this case, your write performance is limited to the performance of that one machine, and is advertised as one write per second. As a benefit though, you can can query that machine using an ancestor query and get an up-to-date list of all the purchases of a given customer.

The syntax for using ancestors is a bit different. The customer part is the same. However, when you create purchases, you'd create it as:

purchase1 = Purchase(ancestor=customer_entity.key)
purchase2 = Purchase(ancestor=customer_entity.key)

This example creates two separate purchase entities. Each purchase will have a different key, and the customer has its own key as well. However, each purchase key will have the customer_entity's key embedded in it. So you can think of the purchase key being twice as long. However, you don't need to keep a separate KeyProperty() for the customer anymore, since you can find it in the purchases key.

class Purchase(ndb.Model):
    # you don't need a KeyProperty for the customer anymore
    # purchase data fields
    price = ndb.IntegerProperty

purchase.key.parent().get()

And in order to query for all the purchases of a given customer:

Purchase.query(ancestor=customer_entity.key).fetch()

The actual of structure of the entities don't change much, mostly the syntax. But the ancestor queries are fully consistent.

The third option that you kinda describe is not recommended. I'm just including it for completeness. It's a bit confusing, and would go something like this:

class Purchase(ndb.Model):
    # purchase data fields
    price = ndb.IntegerProperty()

class Customer(ndb.Model):
    purchases = ndb.StructuredProperty(Purchase, repeated=True)

This is a special case which uses ndb.StructuredProperty. In this case, you will only have a single Customer entity in the datastore. While there's a class for purchases, your purchases won't get stored as separate entities - they'll just be stored as data within the Customer entity.

There may be a couple of reasons to do this. You're only dealing with one entity, so your data fetch will be fully-consistent. You also have reduced write costs when you have to update a bunch of purchases, since you're only writing a single entity. And you can still query on the properties of the Purchase class. However, this was designed for only having a limited number or repeated objects, not hundreds or thousands. And each entity is limited to ta total size of 1MB, so you'll eventually hit that and you won't be able to add more purchases.

3
votes

(from your personal tags I assume you are a java guy, using GAE+java)

First, don't use the ancestor relationships - this has a special purpose to define the transaction scope (aka Entity Groups). It comes with several limitations and should not be used for normal relationships between entities.

Second, do use an ORM instead of low-level API: my personal favourite is objectify. GAE also offers JDO or JPA.

In GAE relations between entities are simply created by storing a reference (a Key) to an entity inside another entity.

In your case there are two possibilities to create one-to-many relationship between Customer and it's Purchases.

public class Customer {
    @Id
    public Long customerId;  // 'Long' identifiers are autogenerated

    // first option: parent-to-children references
    public List<Key<Purchase>> purchases; // one-to-many parent-to-child
}

public class Purchase {
    @Id
    public Long purchaseId;

    // option two: child-to-parent reference
    public Key<Customer> customer;
}

Whether you use option 1 or option 2 (or both) depends on how you plane to access the data. The difference is whether you use get or query. The difference between two is in cost and speed, get being always faster and cheaper.

Note: references in GAE Datastore are manual, there is no referential integrity: deleting one part of a relationship will produce no warning/error from Datastore. When you remove entities it's up to your code to fix references - use transactions to update two entities consistently (hint: no need to use Entity Groups - to update two entities in a transaction you can use XG transactions, enabled by default in objectify).

0
votes

I think the best approach in this specific case would be to use a parent structure.

class Customer(ndb.Model):
    pass

class Purchase(ndb.Model):
    pass

customer = Customer()
customer_key = customer.put()

purchase = Purchase(parent=customer_key)

You could then get all purchases of a customer using

purchases = Purchase.query(ancestor=customer_key)

or get the customer who bough the purchase using

customer = purchase.key.parent().get()

It might be a good idea to keep track of the purchase count indeed when you use that value a lot. You could do that using a _pre_put_hook or _post_put_hook

class Customer(ndb.Model):
    count = ndb.IntegerProperty()

class Purchase(ndb.Model):
    def _post_put_hook(self):
        # TODO check whether this is a new entity.
        customer = self.key.parent().get()
        customer.count += 1
        customer.put()

It would also be good practice to do this action in a transacion, so the count is reset when putting the purchase fails and the other way around.

@ndb.transactional
def save_purchase(purchase):
    purchase.put()