My question has two parts:
how can I set (fine tune) advanced ORC parameters using spark?
Various posts show that there might be issues Spark Small ORC Stripes, How to set ORC stripe size in Spark. I am currently using spark 2.2 on a HDP 2.6.4 plattform, so according to https://community.cloudera.com/t5/Support-Questions/Spark-ORC-Stripe-Size/td-p/189844 this should already be solved. Nontheless, it is unclear to me how to set these parameters when executing:
df.write.orc("/path/to/file")
Perhaps it is just a:
df.write.options(Map("key"-> "value")).orc("/path/to/file")
However, I am also not quite sure which keys I would need here.
NOTE: the 1.4
nativeversion of ORC is used..set("spark.sql.orc.impl", "native") .set("spark.sql.hive.convertMetastoreOrc", "true")
choosing the right parameters:
My data set is repartitioned and sorted using
df.repartition(number, c1,c2,...).sortWithin("c1, c2", "c3", ...)a secondary sort. The order of sort columns is chosen by the cardinality of costly (long string) columns. Highest ones go first.
file size
I want to write gzipped orc files to HDFS. The small file size problem is something I am aware of and obviously want to prevent - but what about the other direction? For example one of my data sets will generate 800MB gzipped orc files (single files inside a partition) if repartitioned accordingly. Are these 800MB already considered too large? Should I try to size these roughly around 300MB? Or 400MB? Please keep in mind, they are already gzipped.
stripe size
Currently, I observe from:
java-jar orc-tools meta foo.orc
that for this file before (but also other ones) spark seems to create stripes with roughly 16MB in size, i.e. 49 in this particular case.
Here is an example of the output for the first Stripe:
Stripe 1:
Column 0: count: 3845120 hasNull: false
Column 1: count: 3845120 hasNull: false min: a max: b sum: 246087680
Column 2: count: 3845120 hasNull: false min: aa max: bb sum: 30288860
Column 3: count: 3845120 hasNull: false min: aaa max: bbb sum: 89174415
Column 4: count: 3845120 hasNull: false
Column 5: count: 3845120 hasNull: false min: 2019-09-24 00:00:00.0 max: 2019-09-24 23:45:00.0 min UTC: 2019-09-24 02:00:00.0 max UTC: 2019-09-25 01:45:00.0
Column 6: count: 3845120 hasNull: false min: 2019-09-24 00:15:00.0 max: 2019-09-25 00:00:00.0 min UTC: 2019-09-24 02:15:00.0 max UTC: 2019-09-25 02:00:00.0
Column 7: count: 3845120 hasNull: false min: 1 max: 36680 sum: 36262602
And in the detailed output after listing all the stripes (again for the first stripe):
Stripes:
Stripe: offset: 3 data: 17106250 rows: 3845120 tail: 185 index: 51578
Stream: column 0 section ROW_INDEX start: 3 length 55
Stream: column 1 section ROW_INDEX start: 58 length 21324
Stream: column 2 section ROW_INDEX start: 21382 length 3944
Stream: column 3 section ROW_INDEX start: 25326 length 12157
Stream: column 4 section ROW_INDEX start: 37483 length 55
Stream: column 5 section ROW_INDEX start: 37538 length 4581
Stream: column 6 section ROW_INDEX start: 42119 length 4581
Stream: column 7 section ROW_INDEX start: 46700 length 4881
Stream: column 1 section DATA start: 51581 length 57693
Stream: column 1 section LENGTH start: 109274 length 16
Stream: column 1 section DICTIONARY_DATA start: 109290 length 623365
Stream: column 2 section DATA start: 732655 length 447898
Stream: column 2 section LENGTH start: 1180553 length 148
Stream: column 2 section DICTIONARY_DATA start: 1180701 length 968
Stream: column 3 section DATA start: 1181669 length 2449521
Stream: column 3 section LENGTH start: 3631190 length 6138
Stream: column 3 section DICTIONARY_DATA start: 3637328 length 303255
Stream: column 5 section DATA start: 3940583 length 5329298
Stream: column 5 section SECONDARY start: 9269881 length 172
Stream: column 6 section DATA start: 9270053 length 5334123
Stream: column 6 section SECONDARY start: 14604176 length 172
Stream: column 7 section DATA start: 14604348 length 2553483
Encoding column 0: DIRECT
Encoding column 1: DICTIONARY_V2[16914]
Encoding column 2: DICTIONARY_V2[214]
Encoding column 3: DICTIONARY_V2[72863]
Encoding column 4: DIRECT
Encoding column 5: DIRECT_V2
Encoding column 6: DIRECT_V2
Encoding column 7: DIRECT_V2
What is recommended here? Hive defaults seem to mention 256MB but this seems to be a completely different value range than what spark computes. What is the rationale here?
So why does:
spark.conf.get("orc.dictionary.key.threshold")
java.util.NoSuchElementException: orc.dictionary.key.threshold
fail even though it can be seen clearly that somehow dictionaries are set? Looking at spark's code base I cannot identify this property being set anywhere https://github.com/apache/spark/search?q=orc.dictionary.key.threshold&unscoped_q=orc.dictionary.key.threshold
orc goodies
Recent versions of orc introduced bloom-filters and indices. Can these also be used from spark?
other tuning tips
Please share any other tips for tuning with me.

