5
votes

I would like to represent millions of products that belong to one or more of dozens of categories.

I'm contemplating a few approaches:

  1. Indexed Category Nodes - Create nodes for each category and create an auto_index on category_name. Then create "isCategoryOf" relationships between each of my product nodes and their respective category nodes.

  2. Individual Category Relationship Types- Create respective "isCategoryGames", "isCategoryFood", "isCategoryLifestyle", etc... relationships between products and the root node.

  3. Storing Categories as a Property of One Relationship Type - Create "isCategory" relationshps between prduct nodes and the root node and store their respective category types in a property of the relationship, e.g. relationship "isCategory" { categoryName:"food"}

Which of these approaches is most efficent and/or scalable. Is there a limit or performance implications of having almost every node in the database connect to the root node?

1

1 Answers

4
votes

If you attach millions of nodes to the root node, you make the root node a supernode. This can be problematic.

The general concept of Option 1 shows promise. If you were modeling food, you might have nodes with a name property like "Nuts", "Dairy Products", "Desserts", "Produce" and a type property of "Category". You would then have other nodes with a name property like "Cherry Cheesecake" with outgoing "category" edges to the "Dairy Products", and "Desserts" nodes.

Whether this structure is going to be performant enough depends on your queries. If you have broad categories like 'food', you could end up with a supernode, and you'll take a linear scan through the connected nodes to find a node with a given property. A linear scan over thousands of things might be fast enough for your purposes, but a scan over 1M things might not.

To find out, I would recommend creating a quick prototype where you generate some random product and category nodes, then connect each product node to a random number of category nodes. Indexing the product and category nodes by name will help you find individual products or categories, but it's the traversals that will cause performance problems if you hit supernodes. Experiment with a few of the Gremlin traversals or Cypher queries that you think might be most problematic. Try scaling up the number of nodes from 1K, 10K, 100K, and 1M with a proportionate number of edges. How do your traversal / query times change?