why “1 HDFS block per HDFS file” is better in parquet official document ？

Question

why “1 HDFS block per HDFS file” is an optimized read setup in parquet official document ？

EDIT:

As in the figure above, the parquet file is made up of row groups. if "1GB row groups, 1GB HDFS block size" , then 1 row group will fit 1 HDFS block. Then column will not outside of HDFS block. So, we no longer need to transfer data. But, what is “1 HDFS block per HDFS file” for ?

I stumpled today to the same official parquet documentation and couldn't find an answer by myself to the same question. I am coming to the same conclusion as you, as long as the row group size is the same as HDFS block size, why should it matter at all to only have 1 Block per 1 HDFS file. Did you find an answer? — wobu

hlagos hlagos · Accepted Answer · 2017-08-14T15:39:42

This is essentially because parquet is a columnar storage format. So lets say that you have stored a 3 GB file with a blocksize of 1GB. To read a whole record you will need to reconstruct the record if the information of each column is not in a single block (which his probably the case because the columnar format), this mean that necessarily in one machine will be needed reconstruct the record requiring the data transfer from other nodes to reassembly the record.

EDIT:

For the following image which compare row storage against column storage, imagine that the column cost doesn't fit in your block size, it means that this column will be outside of your block and will create a new block. If you want to use the data for one whole specific row, the data for the cost column will need to be send for one node to another which is not efficient. I hope it makes sense.

why “1 HDFS block per HDFS file” is better in parquet official document ？

1 Answers