5
votes

I am trying to get my head out of relational database design and into Azure Storage Tables for a Big Data pet project I am about to embark upon.

My first question is around recording version histories of my entities.

Say I have a table called Members and I want to be able to see previous versions of the members details could I set the PartitionKey as their member number and then the RowKey of the datetime stamp that it was updated?

Would this be an effective / recommended approach to this version history problem?

Thanks.

Update: Or would I be better to store the latest entry in the members table and then have a members "history" table that records the changes?

1
Will you always be querying this "Members" table of yours for a particular member? Will there be a scenario where you would want to fetch the details about all the members from this table?Gaurav Mantri
I doubt I would ever want to get a complete collection of members. I might want to get details of a "set" of members, such as "members I am following", etc.Mark Stokes

1 Answers

7
votes

You can take both approaches. Just thinking out loud, here are some of the possible approaches:

Approach 1: Keeping member information and history in a single table (1)

The way this would work is you would keep the PartitionKey as member's unique identifier and the RowKey as the timestamp from when the member's information was updated. Since you would want to get the latest information about the member, I would recommend you keep the RowKey in reverse chronological order using something like:

var rowKey = (DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks).ToString("d20");

To get the latest information about a member, you would fetch just one row for a given PartitionKey (i.e. member ID). To get complete history about a member, you would just fetch all rows.

One problem with this approach is that you can't fetch current details about any members without knowing their member IDs.

Approach 2: Keeping member information and history in separate tables

In this approach you will maintain two tables—one for member information (let's call it Member) and other for member history (let's call it MemberHistory). The PartitionKey in Member table could be member ID and row key could be any arbitrary value (or even null value). The PartitionKey in the MemberHistory table would be member ID and the RowKey would be the timestamp when the information was updated. Again, you would want to store them in reverse chronological order.

The way this would work is that you always perform an InsertOrUpdate (or InsertOrMerge) operation on entities in Members table, while you would always perform Insert operation on entities in MembersHistory table.

While this approach is much neater and gives you the capability of extracting current information about all members, the problem you will run into is that you can't take the advantage of the transaction feature in Azure Table Storage since you're dealing with two tables. So it may happen that your main table gets updated but your history table fails to update, thus you have to cover for that scenario as well.

Approach 3: Keeping member information and history in a single table (2)

In this approach, for the main record and history record, you would keep the PartitionKey as member ID but keep the RowKey as empty for the main record and RowKey as the timestamp (again in reverse chronological order) for the history tables.

To query a member's current record, you would query something like (pseudo code):

PartitionKey == 'Member ID' && RowKey == ''

To query a member's history record, you would query something like (pseudo code):

PartitionKey == 'Member ID' && RowKey != ''

To query all members for their current record, you would query something like (pseudo code):

RowKey == ''

Please note that the query above will do a full table scan so be prepared for continuation tokens and some performance degradation.