0
votes

I am creating a graph and initially used a partition key that seems like the only logical one given the set of data. However, the number of vertices and edges ends up being too large for a single partition. I did not create a partitioned collection yet but only created a single 10GB collection. I ran this out of space and filled it up as I wasn't sure how many vertices and edges I would have. The data is a set of categories with varying number of subcategories(and subcategories of those subcats down to an arbitrary depth). The data is a category id and name and a market for which the category applies. The partition key is currently the market. Within a given market there are a bunch of category/subcat/subcat/... that exhausted the 10GB partition for that given market.

If all I have is a category id which is unique, a category name, and a market (as a vertex), and then a parentOf edge the connects a parent category to its children, then what else would make sense as a partition key? If I have a parent category (vertex) with id of 1, a market of 'US', and it has 100 subcategories each with their own id and the corresponding 100 edges for the parentOf connections all with same market of 'US', then the only other option I have for a partition key other than the market is the category id. The issue is, how efficient would lookups and traversals be if the children and children of those children(and edges) are in other partitions?

How do you build a very large graph with a scenario like this?

Given an arbitrary category id, what would the performance be like to find all the children and walk the edges down to find all the children in the hierarchy of those edges?

What would the partition key attribute for the edges need to be? The same partition key as the parent vertex or the same partition key as the child vertex?

Am I thinking about this wrong?

1

1 Answers

1
votes

My recommendation for any non-trivial graph implementation is to make a super generic property that all your docs must include such as (quite literally) partitionKey. Then you're free to use the value for market in that field where it makes sense and something else to support a different query pattern.

The important thing to understand is that queries across multiple partitions are going to be slow. So as much as possible you should tailor your partition key to support the best balance between reads and writes.

Ask yourself "What queries will I need to perform against this data most often?" and then adjust the partitionKey for the various documents accordingly.

As for edges, when you add an edge between two vertices using Gremlin, Cosmos automatically places the edge document in the same partition as the out vertex.