2
votes

I'm currently writing parquet with map reduce, I config my row group size to be 256M, and hdfs block size to be 256M as well. The output file size is around 1G per file.

So I should expected 4 row groups in the generated file. But when I use:

parquet-tools meta path/to/my/file | grep "row group"

It gives me 63 row groups with different size and row count:

row group 1:                      RC:69816 TS:244168913
row group 2:                      RC:35111 TS:117407826
row group 3:                      RC:18488 TS:60107388
row group 4:                      RC:10357 TS:33260415
row group 5:                      RC:7905 TS:24956045
row group 6:                      RC:4754 TS:15149122
row group 7:                      RC:3862 TS:12476651
row group 8:                      RC:2738 TS:9001631
row group 9:                      RC:2104 TS:7120040
row group 10:                     RC:1910 TS:6398391
row group 11:                     RC:1508 TS:5219072
row group 12:                     RC:1386 TS:4676154
row group 13:                     RC:1124 TS:3950635
row group 14:                     RC:999 TS:3518545
row group 15:                     RC:865 TS:3121657
row group 16:                     RC:774 TS:2801614
row group 17:                     RC:678 TS:2490904
row group 18:                     RC:511 TS:1996167
row group 19:                     RC:69808 TS:243894989
row group 20:                     RC:30176 TS:99585195
row group 21:                     RC:20678 TS:67779524
row group 22:                     RC:10743 TS:34547874
row group 23:                     RC:8258 TS:26080110
row group 24:                     RC:5227 TS:16456577
row group 25:                     RC:4136 TS:13321721
row group 26:                     RC:3207 TS:10272043
row group 27:                     RC:2437 TS:8107932
row group 28:                     RC:1945 TS:6563867
row group 29:                     RC:1561 TS:5320028
row group 30:                     RC:1389 TS:4809485
row group 31:                     RC:1206 TS:4251584
row group 32:                     RC:996 TS:3581746
row group 33:                     RC:895 TS:3203224
row group 34:                     RC:757 TS:2869939
row group 35:                     RC:653 TS:2550716
row group 36:                     RC:531 TS:2008746
row group 37:                     RC:69706 TS:244420245
row group 38:                     RC:32703 TS:109391929
row group 39:                     RC:18640 TS:60918458
row group 40:                     RC:10737 TS:34272225
row group 41:                     RC:7812 TS:24814707
row group 42:                     RC:5176 TS:16206655
row group 43:                     RC:4123 TS:13224377
row group 44:                     RC:3391 TS:10946649
row group 45:                     RC:2138 TS:7248145
row group 46:                     RC:1960 TS:6566944
row group 47:                     RC:1538 TS:5294523
row group 48:                     RC:1355 TS:4744634
row group 49:                     RC:1225 TS:4194625
row group 50:                     RC:1026 TS:3587484
row group 51:                     RC:877 TS:3134267
row group 52:                     RC:785 TS:2846718
row group 53:                     RC:675 TS:2546836
row group 54:                     RC:538 TS:2016450
row group 55:                     RC:69762 TS:244915809
row group 56:                     RC:32390 TS:108310300
row group 57:                     RC:18095 TS:58754777
row group 58:                     RC:10759 TS:34405301
row group 59:                     RC:8195 TS:26029310
row group 60:                     RC:5286 TS:16597963
row group 61:                     RC:4231 TS:13415076
row group 62:                     RC:3538 TS:11465640
row group 63:                     RC:135 TS:688850

There is a recursive pattern for the row group, anyone know why parquet does not honor my configured row group size (256M) ?

1
can you share how you are creating the file?hlagos
If you look at the TS: attribute of each row group, you get exactly what you are asking for. None of them is bigger than 256 000 000 bytes (256Mb). Maybe compression is disabled?Joha
I experienced something quite similar. When i write a wide table as parquet file, I get a lot of very small row groups. parquet.block.size is 128Mb but I get 2Mb row groups... Do you have any experiences or suggestions?Joha
how many columns does the file have?Joha

1 Answers

0
votes

This is an unsolved issue, when using Parquet-MR to write Parquet files. The algorithm does not take into account the compression, creating more row groups than expected.

You can find more info about it here: https://issues.apache.org/jira/browse/PARQUET-1337