1
votes

I have a site written in PHP. It currently uses MySQL for all of its database needs (I'm open to additional DB technologies).

The system's content is interrelated. These relationships can be represented as a graph where vertices are pieces of content and edges are the relationships. I need to be able to traverse that graph. In particular I need to be able to:

  • Get the child count at a given depth (e.g. how many grandchildrean does an item have)
  • Get the cumulative child count at a given depth (e.g. how many children and grandchildren does an item have)
  • Get the maximum depth for a given root (e.g. what is the longest path from this item)
  • Get the children at a given depth (e.g. who are the grandchildren of this item)
  • Get the parents at a given depth (e.g. who are the grandparents of this item)
  • Look up which statuses (such as "hidden" or "locked") have been inherited from parents.

Because it is a graph on a dynamic system and not a tree or traditional hierarchy, there are some intricacies that I think rule out the usual SQL-based tricks (e.g. Adjacency List and Path Enumeration).

The Main Intricacies:

  • Content can have more than one child.

  • Content can have more than one parent.

  • An item's relationship graph might look different for each user. For instance, certain content might be hidden for one person but not another.

  • Items can appear more than once on a graph tree, and can appear at different path lengths (e.g. item 50 could be an immediate child while also being a 3rd generation child).

  • Graphs can contain hundreds of thousands of items.

Some Additional Intricacies:

  • Different types of content can be related (as in, a poll could be related to a forum post, or a user could be related to a community)

  • There are several different types of relationship (as in, parent/child relationship, ownership relationship, peer relationship)

  • Depending on the type of relationship, permissions and restrictions may or may not be passed from parent to child (e.g. if a parent is hidden, the child will be hidden as well, but if a peer item is hidden that status isn't passed along)

My Naive (slow) "Solutions"

Currently I am taking the naive approach using SQL. I have a single "Relationships" table with these columns:

item1ID (int)
item1TypeID (int)
item2ID (int)
item2TypeID (int)
relationshipTypeID (int)

In PHP, I dynamically generate queries full of inner self joins to look up the maximum depth, and then once that is figured out I generate a single query which traverses the hierarchy and retrieves whatever information I need. This is already too slow, even with proper indexing.

My second naive approach was going to be moving that traversal and depth lookup into stored procedures. I have no idea if this would actually create a significant speed improvement. I was also thinking of incorporating some sort of caching mechanism so I could avoid looking up maximum depths as often, but that seems like it simply avoiding the real problem.

My Question

There has to be a better way. What is it? I know there are a lot of questions and answers already on StackOverflow regarding the issue of hierarchical information in SQL, but this is not quite hierarchy -- it is a full blown graph.

Since I have strong models I can mix in another DB technology to handle the relationship side of things without ruining the existing code base. I have been looking into NoSQL solutions but I know practically nothing about them. I have also heard of "Graph Databases" (such as Neo4J) which, based on the name and the various powerpoint slides I've seen, sounds like exactly what I need. However, I don't know which ones are actually robust enough to stick around or which ones would play well with PHP.

Help me StackOverflow, you're my only hope.

1

1 Answers

1
votes

From your description, Neo4j should indeed be a very good match to the problems you are facing. For example, the relationship type support should prove useful here. There's an active community, which increases chances this graphdb will survive into the future. It has also been in production for a long time.

The PHP side of Neo4j isn't that shiny so far, but I think the REST API opens up for some interesting scenarios. There's a PHP REST client (quick intro here) being developed. Then there's an experiment with the PHP/Java bridge (I haven't tried that one myself).

Note that some of your requirements are simply put very hard problems, which can't be easily solved using any technology. For example, finding the maximum depth may be an extremely expensive operation depending on the layout of the graph. In some cases it can work out well to take a bigger hit on inserts and store for instance the "children count" on each node.

Regarding RDBMS, I've struggled with similar issues in a PHP/MySQL based system. Using stored procedures helped out regarding structuring the project, but performance actually got a little worse (this was at the time stored procedures was a new feature in MySQL). PostgreSQL performs better with complex queries in my experience, but writing real graph queries for it isn't really possible (read here and here for why this is so!)

Disclaimer: I'm part of the Neo4j team