0
votes

I have a 25TB Titan graph DB, hosted on an HBase table.

The graph holds data of my users, such as interests, friendships etc. I also keep all of the data on an SQL relational DB.

I am working on a new feature that requires me to change the scheme of the User vertex, splitting it to multiple smaller vertices and edges.

How should I handle such a case? What is the best practice for such a huge change on Titan data? Should I think about re-building the graph from the SQL data, or should I migrate the existing data? (billions of vertices and edges?

1
Why don't migrate from Titan to OrientDB via the graphml or graphson format? - Lvca
It is an option. Can you explain why OrientDB is more suitable for my needs? - imriqwe
OrientDB and Titan are both TinkerPop compatible, so it's merely an export/import through graphml format. Furthermore OrientDB has the concept of vertex and edge types with the support to inheritance to design complex domains. Transactions? In OrientDB they are ACID, the dbms is not eventually consistent. Furthermore we store edges separately, so you don't have to store them in an index that you cannot drop. - Lvca
What are the technologies used underneath? Can it use HBase as a backend? How is the performance of OrientDB compared to Titan? - imriqwe
About performance, try by your own. About the backend it's proprietary and optimized for index-free adjacency. - Lvca

1 Answers

3
votes

By and large the approach for these sorts of very large schema changes is independent of database technology. Unless you can afford to take the whole system offline while you make the change, you'll need to migrate the data over time, which means you'll have two versions of the data around at the same time. Without looking at the details of your suggested change, it's hard to say what your best strategy is.

If I assume your plan is "just" to take each user vertex and split it into several smaller interconnected vertices, I'll assume in both cases you still have a canonical user vertex you can find in a search, e.g. user 5 will be represented by either one "big vertex" or one "small vertex connected to other vertices".

Create a process which creates the "small vertex" copy of each "big vertex", but keep the "big vertex" around, too. This will take time to run, but it will eventually finish. Edits to vertices will have to update both "big" and "small". Do your searches on just the "big" ones, since they'll still be around.

After some time, you will have a "small" vertex for every "big" vertex. Then you can deploy code which only does searches for the "small" vertices. After that is proven successful, you can retire the code which simultaneously edits both, and then of course run another job which deletes all the "big" vertices.

It's a pain, but when you have a reasonable amount of data in a live system, it's the only approach you can take.