8
votes

If you have a column family, all the columns for a rowkey are in the same HFile? Could data from a rowkey and same column family mixed in different HFiles?. It's because I thought they are sorted, but I read in a book:

Data from a single column family for a single row need not be stored in the same HFile. That's why the row could be too width and it doesn't fit in a single HFile?

The only requirement is that within an HFile, data for a row’s column family is stored together. It seems a little contradictory to me.

Note: I have been reading a little about the topic. HBase uses LSM tree. I have a rowkey and all data in one HFile. Later, I could add some new data, they will store in memory, when memory is full, HBase'll store these data in a new HFile. So that, I could have qualifiers for one rowkey in two HFiles. If I want to do a get or scan operation about that rowkey, I'll have to seek in two files. With the time, HBase will execute a major compaction, it'll create an only HFile joining the old two HFiles and delete them after the compaction. So, If I want to look up that rowkey, I will only need one search. Am I right?? I didn't understand why there're minor and major compaction, because they seem to do the same.

3

3 Answers

10
votes

A column family is a collection of HFiles. If you look at the directory structure of a table, it looks like this:

  1. /table/region-id/column-family1/[list of HFiles]
  2. /table/region-id/column-family2/[list of HFiles]

These HFiles are immutable, and sorted. When reading, the Scanner (which reads the data) ensures that it takes into account all HFiles while reading a data for a row key and a given column family.

Data from a single column family for a single row need not be stored in the same HFile. So, this is true.

The second bold statement, it could be derived from the fact that the data in a HFile is sorted, so in a given HFile, data related to a row key is stored together.

1
votes

Yes, It is correct. Difference is:

Minor compactions are designed to minimally harm HBase performance, so there is an upper limit on the number of HFiles involved. These are relatively lightweight and happen more frequently. Major compactions are the only chance HBase has to clean up deleted records. Resolving a delete requires removing both the deleted record and the deletion marker. There’s no guarantee that both the record and marker are in the same HFile.

Also, the minor compactions are triggered each time a memstore is flushed, and will merge some of the store files. While, major compactions are run about every 24 hours and merge together all store files into one. The 24 hours is adjusted with a random margin of up to 20% to avoid many major compactions happening at the same time. Major compactions can also be triggered manually, via the API or the shell.

There is another difference between minor and major compactions: major compactions process delete markers, max versions, etc, while minor compactions don’t.

0
votes

column families are stored into separate HFiles. thus each column family has its own separate HFile. this also means that the row key will be duplicated in those different HFiles therefore officially it's recommended to keep as less cf as possible(<=3 per table).