I am creating a graph and initially used a partition key that seems like the only logical one given the set of data. However, the number of vertices and edges ends up being too large for a single partition. I did not create a partitioned collection yet but only created a single 10GB collection. I ran this out of space and filled it up as I wasn't sure how many vertices and edges I would have. The data is a set of categories with varying number of subcategories(and subcategories of those subcats down to an arbitrary depth). The data is a category id and name and a market for which the category applies. The partition key is currently the market. Within a given market there are a bunch of category/subcat/subcat/... that exhausted the 10GB partition for that given market.
If all I have is a category id which is unique, a category name, and a market (as a vertex), and then a parentOf edge the connects a parent category to its children, then what else would make sense as a partition key? If I have a parent category (vertex) with id of 1, a market of 'US', and it has 100 subcategories each with their own id and the corresponding 100 edges for the parentOf connections all with same market of 'US', then the only other option I have for a partition key other than the market is the category id. The issue is, how efficient would lookups and traversals be if the children and children of those children(and edges) are in other partitions?
How do you build a very large graph with a scenario like this?
Given an arbitrary category id, what would the performance be like to find all the children and walk the edges down to find all the children in the hierarchy of those edges?
What would the partition key attribute for the edges need to be? The same partition key as the parent vertex or the same partition key as the child vertex?
Am I thinking about this wrong?