I'm willing to store a Property Graph into HBase. A Property Graph is a graph nodes and edges have properties and multiple edges can link the same tuple of nodes as long as the edges belong to different types.
My query pattern will be either asking for properties and neighborhood or traversing the graph. An example is: Vertex[name=claudio]=>OutgoingEdge[knows]=>Vertex[gender=female], which will give me all the female people that claudio likes.
I know that a graph database does just this, but they usually don't scale on multiple nodes in case of a huge dataset. So I'm willing to implement this on a NoSQL ColumnStore (HBase, Cassandra...)
My datamodel follows.
Vertices Table:
key: vertexid (uuid)
Family "Properties:": <property name>=><property value>, ...
Family "OutgoingEdges:": <edge key>=><other vertexid>, ...
Family "IncomingEdges:": same as outgoing edges...
This table allows me to fetch quickly the properties of a vertex and its adjacency list. I can't use the vertexid as the other endpoint because multiple edges (with different types) can connect the same two vertices.
Edges Table:
key: edge key (composite(<source vertexid>, <destination vertexid>,
<edge typename>)) (i.e. vertexid1_vertexid2_knows)
Family "Properties:": <property name>=><property value>, ...
This table allows me to fetch quickly the properties of an edge.
Edges Types:
key: composite(<source vertexid>, "out|in", <edge typename>) (i.e.
vertexid1_out_knows)
Family "Neighbor:": <destination vertexid>=>null,...
This table allows me to search/scan for edges that are either incoming or outgoing from a vertex and belong to specific type and would be the core of the traversing ability of the API (so i want it to be as fast as possible both in terms of network I/O (RPCs), disk I/O (seek)). It should also "scale" on the size of the graph, meaning that with the growth of the graph the cost of this type of operation should depend on the number of edges outgoing from the vertex and not on the total number of vertices and edges. The example above i'd be considering vertexid1 the source vertex with property name:claudio i'd scan vertexid1_out_knows and receive the list of vertices connected. After that i can scan on the column "Properties:gender" on these vertices and look for those having the "female" value.
Questions:
1) General: do you see a better data model for my operations?
2) Can i fit everything in one table where for certain keys some
families would be empty (i.e. the "OutgoingEdges:" family would not make
sense for the edges)? I'd like that because as you can see all the keys
are composed by the vertexid uuid prefix, so they would be very compact
and fit mostly on the same regionserver.
3) I guess that for the scanning I'd make extensive use of Filters. I
guess regexp Filter will be my friend. Do you have concerns about
performance of filters applied to this data model?