Hbase and HFiles. How does it store the columns family?

Question

If you have a column family, all the columns for a rowkey are in the same HFile? Could data from a rowkey and same column family mixed in different HFiles?. It's because I thought they are sorted, but I read in a book:

Data from a single column family for a single row need not be stored in the same HFile. That's why the row could be too width and it doesn't fit in a single HFile?

The only requirement is that within an HFile, data for a row’s column family is stored together. It seems a little contradictory to me.

Note: I have been reading a little about the topic. HBase uses LSM tree. I have a rowkey and all data in one HFile. Later, I could add some new data, they will store in memory, when memory is full, HBase'll store these data in a new HFile. So that, I could have qualifiers for one rowkey in two HFiles. If I want to do a get or scan operation about that rowkey, I'll have to seek in two files. With the time, HBase will execute a major compaction, it'll create an only HFile joining the old two HFiles and delete them after the compaction. So, If I want to look up that rowkey, I will only need one search. Am I right?? I didn't understand why there're minor and major compaction, because they seem to do the same.

Curious Curious · Accepted Answer · 2014-03-30T00:38:39

A column family is a collection of HFiles. If you look at the directory structure of a table, it looks like this:

/table/region-id/column-family1/[list of HFiles]
/table/region-id/column-family2/[list of HFiles]

These HFiles are immutable, and sorted. When reading, the Scanner (which reads the data) ensures that it takes into account all HFiles while reading a data for a row key and a given column family.

Data from a single column family for a single row need not be stored in the same HFile. So, this is true.

The second bold statement, it could be derived from the fact that the data in a HFile is sorted, so in a given HFile, data related to a row key is stored together.

Hbase and HFiles. How does it store the columns family?

3 Answers