I'm investigating the different types of NoSQL database types and I'm trying to wrap my head around the data model of column-family stores, such as Bigtable, HBase and Cassandra.
First model
Some people describe a column family as a collection of rows, where each row contains columns [1], [2]. An example of this model (column families are uppercased):
{
"USER":
{
"codinghorror": { "name": "Jeff", "blog": "http://codinghorror.com/" },
"jonskeet": { "name": "Jon Skeet", "email": "[email protected]" }
},
"BOOKMARK":
{
"codinghorror":
{
"http://codinghorror.com/": "My awesome blog",
"http://unicorns.com/": "Weaponized ponies"
},
"jonskeet":
{
"http://msmvps.com/blogs/jon_skeet/": "Coding Blog",
"http://manning.com/skeet2/": "C# in Depth, Second Edition"
}
}
}
Second model
Other sites describe a column family as a group of related columns within a row [3], [4]. Data from the previous example, modeled in this fashion:
{
"codinghorror":
{
"USER": { "name": "Jeff", "blog": "http://codinghorror.com/" },
"BOOKMARK":
{
"http://codinghorror.com/": "My awesome blog",
"http://unicorns.com/": "Weaponized ponies"
}
},
"jonskeet":
{
"USER": { "name": "Jon Skeet", "email": "[email protected]" },
"BOOKMARK":
{
"http://msmvps.com/blogs/jon_skeet/": "Coding Blog",
"http://manning.com/skeet2/": "C# in Depth, Second Edition"
}
}
}
A possible rationale behind the first model is that not all column families have a relation like USER
and BOOKMARK
do. This implies that not all column families contain identical keys. Placing the column families at the outer level feels more natural from this point of view.
The name 'column family' implies a group of columns. This is exactly how column families are presented in the second model.
Both models are valid representations of the data. I realize that these representations are solely for communicating the data towards humans; applications don't 'think' of data in such a way.
Question
What is the 'standard' definition of a column family? Is it a collection of rows, or a group of related columns within a row?
I have to write a paper on the subject, so I'm also interested in how people usually explain the 'column family' concept to other people. Both of these models seem to contradict each other. I'd like to use the 'correct' or generally accepted model to describe column-family stores.
Update
I have settled with the second model for explaining the data model in my paper. I'm still interested in how you explain the data model of column-family stores to other people.