6
votes

I'm willing to store a Property Graph into HBase. A Property Graph is a graph nodes and edges have properties and multiple edges can link the same tuple of nodes as long as the edges belong to different types.

My query pattern will be either asking for properties and neighborhood or traversing the graph. An example is: Vertex[name=claudio]=>OutgoingEdge[knows]=>Vertex[gender=female], which will give me all the female people that claudio likes.

I know that a graph database does just this, but they usually don't scale on multiple nodes in case of a huge dataset. So I'm willing to implement this on a NoSQL ColumnStore (HBase, Cassandra...)

My datamodel follows.

Vertices Table:
key: vertexid (uuid)
Family "Properties:": <property name>=><property value>, ...
Family "OutgoingEdges:": <edge key>=><other vertexid>, ...
Family "IncomingEdges:": same as outgoing edges...

This table allows me to fetch quickly the properties of a vertex and its adjacency list. I can't use the vertexid as the other endpoint because multiple edges (with different types) can connect the same two vertices.

Edges Table:
key: edge key (composite(<source vertexid>, <destination vertexid>, <edge typename>)) (i.e. vertexid1_vertexid2_knows)
Family "Properties:": <property name>=><property value>, ...

This table allows me to fetch quickly the properties of an edge.

Edges Types:
key: composite(<source vertexid>, "out|in", <edge typename>) (i.e. vertexid1_out_knows)
Family "Neighbor:": <destination vertexid>=>null,...

This table allows me to search/scan for edges that are either incoming or outgoing from a vertex and belong to specific type and would be the core of the traversing ability of the API (so i want it to be as fast as possible both in terms of network I/O (RPCs), disk I/O (seek)). It should also "scale" on the size of the graph, meaning that with the growth of the graph the cost of this type of operation should depend on the number of edges outgoing from the vertex and not on the total number of vertices and edges. The example above i'd be considering vertexid1 the source vertex with property name:claudio i'd scan vertexid1_out_knows and receive the list of vertices connected. After that i can scan on the column "Properties:gender" on these vertices and look for those having the "female" value.

Questions:

1) General: do you see a better data model for my operations?
2) Can i fit everything in one table where for certain keys some families would be empty (i.e. the "OutgoingEdges:" family would not make sense for the edges)? I'd like that because as you can see all the keys are composed by the vertexid uuid prefix, so they would be very compact and fit mostly on the same regionserver.
3) I guess that for the scanning I'd make extensive use of Filters. I guess regexp Filter will be my friend. Do you have concerns about performance of filters applied to this data model?

1

1 Answers

2
votes

This type of model looks like a sensible starting point for Cassandra (don't know much about HBase) - but for any distributed store you will run up against problems when traversing, because traversals will cross multiple nodes.

This is why dedicated graph databases such as Neo4J use a single-node design, and try to keep all data in RAM.

Looking up properties of particular nodes or edges should work well and scale horizontally - Twitter's FlockDB (now apparently abandoned) was a notable example of this.

You also need to consider whether you need lookups other than by ID (i.e. do you need any indexes)?