I'm currently writing parquet with map reduce, I config my row group size to be 256M, and hdfs block size to be 256M as well. The output file size is around 1G per file.
So I should expected 4 row groups in the generated file. But when I use:
parquet-tools meta path/to/my/file | grep "row group"
It gives me 63 row groups with different size and row count:
row group 1: RC:69816 TS:244168913
row group 2: RC:35111 TS:117407826
row group 3: RC:18488 TS:60107388
row group 4: RC:10357 TS:33260415
row group 5: RC:7905 TS:24956045
row group 6: RC:4754 TS:15149122
row group 7: RC:3862 TS:12476651
row group 8: RC:2738 TS:9001631
row group 9: RC:2104 TS:7120040
row group 10: RC:1910 TS:6398391
row group 11: RC:1508 TS:5219072
row group 12: RC:1386 TS:4676154
row group 13: RC:1124 TS:3950635
row group 14: RC:999 TS:3518545
row group 15: RC:865 TS:3121657
row group 16: RC:774 TS:2801614
row group 17: RC:678 TS:2490904
row group 18: RC:511 TS:1996167
row group 19: RC:69808 TS:243894989
row group 20: RC:30176 TS:99585195
row group 21: RC:20678 TS:67779524
row group 22: RC:10743 TS:34547874
row group 23: RC:8258 TS:26080110
row group 24: RC:5227 TS:16456577
row group 25: RC:4136 TS:13321721
row group 26: RC:3207 TS:10272043
row group 27: RC:2437 TS:8107932
row group 28: RC:1945 TS:6563867
row group 29: RC:1561 TS:5320028
row group 30: RC:1389 TS:4809485
row group 31: RC:1206 TS:4251584
row group 32: RC:996 TS:3581746
row group 33: RC:895 TS:3203224
row group 34: RC:757 TS:2869939
row group 35: RC:653 TS:2550716
row group 36: RC:531 TS:2008746
row group 37: RC:69706 TS:244420245
row group 38: RC:32703 TS:109391929
row group 39: RC:18640 TS:60918458
row group 40: RC:10737 TS:34272225
row group 41: RC:7812 TS:24814707
row group 42: RC:5176 TS:16206655
row group 43: RC:4123 TS:13224377
row group 44: RC:3391 TS:10946649
row group 45: RC:2138 TS:7248145
row group 46: RC:1960 TS:6566944
row group 47: RC:1538 TS:5294523
row group 48: RC:1355 TS:4744634
row group 49: RC:1225 TS:4194625
row group 50: RC:1026 TS:3587484
row group 51: RC:877 TS:3134267
row group 52: RC:785 TS:2846718
row group 53: RC:675 TS:2546836
row group 54: RC:538 TS:2016450
row group 55: RC:69762 TS:244915809
row group 56: RC:32390 TS:108310300
row group 57: RC:18095 TS:58754777
row group 58: RC:10759 TS:34405301
row group 59: RC:8195 TS:26029310
row group 60: RC:5286 TS:16597963
row group 61: RC:4231 TS:13415076
row group 62: RC:3538 TS:11465640
row group 63: RC:135 TS:688850
There is a recursive pattern for the row group, anyone know why parquet does not honor my configured row group size (256M) ?
TS:
attribute of each row group, you get exactly what you are asking for. None of them is bigger than 256 000 000 bytes (256Mb). Maybe compression is disabled? – Johaparquet.block.size
is 128Mb but I get 2Mb row groups... Do you have any experiences or suggestions? – Joha