spark ORC fine tuning (file size, stripes)

Question

My question has two parts:

how can I set (fine tune) advanced ORC parameters using spark?

Various posts show that there might be issues Spark Small ORC Stripes, How to set ORC stripe size in Spark. I am currently using spark 2.2 on a HDP 2.6.4 plattform, so according to https://community.cloudera.com/t5/Support-Questions/Spark-ORC-Stripe-Size/td-p/189844 this should already be solved. Nontheless, it is unclear to me how to set these parameters when executing:

    df.write.orc("/path/to/file")

Perhaps it is just a:

    df.write.options(Map("key"-> "value")).orc("/path/to/file")

However, I am also not quite sure which keys I would need here.

NOTE: the 1.4 native version of ORC is used.
 .set("spark.sql.orc.impl", "native")
  .set("spark.sql.hive.convertMetastoreOrc", "true")

choosing the right parameters:

My data set is repartitioned and sorted using df.repartition(number, c1,c2,...).sortWithin("c1, c2", "c3", ...) a secondary sort. The order of sort columns is chosen by the cardinality of costly (long string) columns. Highest ones go first.

file size

I want to write gzipped orc files to HDFS. The small file size problem is something I am aware of and obviously want to prevent - but what about the other direction? For example one of my data sets will generate 800MB gzipped orc files (single files inside a partition) if repartitioned accordingly. Are these 800MB already considered too large? Should I try to size these roughly around 300MB? Or 400MB? Please keep in mind, they are already gzipped.

stripe size

Currently, I observe from:

java-jar orc-tools meta foo.orc

that for this file before (but also other ones) spark seems to create stripes with roughly 16MB in size, i.e. 49 in this particular case.

Here is an example of the output for the first Stripe:

Stripe 1:
    Column 0: count: 3845120 hasNull: false
    Column 1: count: 3845120 hasNull: false min: a max: b sum: 246087680
    Column 2: count: 3845120 hasNull: false min: aa max: bb sum: 30288860
    Column 3: count: 3845120 hasNull: false min: aaa max: bbb sum: 89174415
    Column 4: count: 3845120 hasNull: false
    Column 5: count: 3845120 hasNull: false min: 2019-09-24 00:00:00.0 max: 2019-09-24 23:45:00.0 min UTC: 2019-09-24 02:00:00.0 max UTC: 2019-09-25 01:45:00.0
    Column 6: count: 3845120 hasNull: false min: 2019-09-24 00:15:00.0 max: 2019-09-25 00:00:00.0 min UTC: 2019-09-24 02:15:00.0 max UTC: 2019-09-25 02:00:00.0
    Column 7: count: 3845120 hasNull: false min: 1 max: 36680 sum: 36262602

And in the detailed output after listing all the stripes (again for the first stripe):

Stripes:
  Stripe: offset: 3 data: 17106250 rows: 3845120 tail: 185 index: 51578
    Stream: column 0 section ROW_INDEX start: 3 length 55
    Stream: column 1 section ROW_INDEX start: 58 length 21324
    Stream: column 2 section ROW_INDEX start: 21382 length 3944
    Stream: column 3 section ROW_INDEX start: 25326 length 12157
    Stream: column 4 section ROW_INDEX start: 37483 length 55
    Stream: column 5 section ROW_INDEX start: 37538 length 4581
    Stream: column 6 section ROW_INDEX start: 42119 length 4581
    Stream: column 7 section ROW_INDEX start: 46700 length 4881
    Stream: column 1 section DATA start: 51581 length 57693
    Stream: column 1 section LENGTH start: 109274 length 16
    Stream: column 1 section DICTIONARY_DATA start: 109290 length 623365
    Stream: column 2 section DATA start: 732655 length 447898
    Stream: column 2 section LENGTH start: 1180553 length 148
    Stream: column 2 section DICTIONARY_DATA start: 1180701 length 968
    Stream: column 3 section DATA start: 1181669 length 2449521
    Stream: column 3 section LENGTH start: 3631190 length 6138
    Stream: column 3 section DICTIONARY_DATA start: 3637328 length 303255
    Stream: column 5 section DATA start: 3940583 length 5329298
    Stream: column 5 section SECONDARY start: 9269881 length 172
    Stream: column 6 section DATA start: 9270053 length 5334123
    Stream: column 6 section SECONDARY start: 14604176 length 172
    Stream: column 7 section DATA start: 14604348 length 2553483
    Encoding column 0: DIRECT
    Encoding column 1: DICTIONARY_V2[16914]
    Encoding column 2: DICTIONARY_V2[214]
    Encoding column 3: DICTIONARY_V2[72863]
    Encoding column 4: DIRECT
    Encoding column 5: DIRECT_V2
    Encoding column 6: DIRECT_V2
    Encoding column 7: DIRECT_V2

What is recommended here? Hive defaults seem to mention 256MB but this seems to be a completely different value range than what spark computes. What is the rationale here?

So why does:

spark.conf.get("orc.dictionary.key.threshold")
java.util.NoSuchElementException: orc.dictionary.key.threshold

fail even though it can be seen clearly that somehow dictionaries are set? Looking at spark's code base I cannot identify this property being set anywhere https://github.com/apache/spark/search?q=orc.dictionary.key.threshold&unscoped_q=orc.dictionary.key.threshold

orc goodies

Recent versions of orc introduced bloom-filters and indices. Can these also be used from spark?

other tuning tips

Please share any other tips for tuning with me.

Georg Heiler Georg Heiler · Accepted Answer · 2019-10-22T22:57:21

incomplete learnings

Quite some parts of the question are still open. Please do improve the answer.

how to set tuning parameters in spark

for ORC advanced settings:

https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html
usersDF.write.format("orc")
  .option("orc.bloom.filter.columns", "favorite_color")
  .option("orc.dictionary.key.threshold", "1.0")
  .save("users_with_options.orc")

indeed, the can simply be passed in as .option to the writer. In case you want to set these when starting spark using --conf make sure to prefix them with spark.orc.bloom.filter.columns as they will be ignored otherwise.

choosing parameters

file size & stripe size

Choosing the right file size is important. Larger tends to be better. In fact I could observe differences in about 1GB for 5 vs 10 files (5 had less storage requirements).

https://community.cloudera.com/t5/Community-Articles/ORC-Creation-Best-Practices/ta-p/248963 ORC files are splittable on a stripe level. Stripe size is configurable and should depend on average length (size) of records and on how many unique values of those sorted fields you can have. If search-by field is unique (or almost unique), decrease stripe size, if heavily repeated – increase. While default is 64 MB, keep stripe size in between ¼ of block-size to 4 blocks-size (default ORC block size is 256 MB

This translates to larger stripes are better but more time consuming to create during the load process (tradeoff).

orc goodies

Indices are dropped in Hive 3.0 from hives side, as their functionality is implemented directly in the ORC file (min-max when sorted is very effective for ranges, bloomfilter for equi-join conditions. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Indexing

Additionally creating a bloomfilter makes sense, but has a tradeoff in storage and time. When bloomfilters are created, these can be viewed as outlined before with orce-tools

Stripes:
  Stripe: offset: 3 data: 20833464 rows: 3475000 tail: 256 index: 3981255
    Stream: column 0 section ROW_INDEX start: 3 length 52
    Stream: column 0 section BLOOM_FILTER start: 55 length 17940
    Stream: column 1 section ROW_INDEX start: 17995 length 31010
    Stream: column 1 section BLOOM_FILTER start: 49005 length 610564
    Stream: column 2 section ROW_INDEX start: 659569 length 4085
    Stream: column 2 section BLOOM_FILTER start: 663654 length 378695
    Stream: column 3 section ROW_INDEX start: 1042349 length 11183
    Stream: column 3 section BLOOM_FILTER start: 1053532 length 1936342

Sorting is crucial (https://community.cloudera.com/t5/Community-Articles/ORC-Creation-Best-Practices/ta-p/248963) and should be performed as secondary sort (as already outlined in the quesiton).

parameters

which seem useful and do not require super time intensive fine tuning:

orc.dictionary.key.threshold=0.95 # force dict (almost) always (seems useful for almost all (non streaming) use cases)
orc.bloom.filter.columns    "*" # do not use star, but select desired columns to save space

Additionally, orc.column.encoding.direct https://orc.apache.org/specification/ORCv1/ (search for these various encodings) might make sense.

spark suggests https://spark.apache.org/docs/latest/cloud-integration.html:

spark.sql.orc.filterPushdown true
spark.sql.orc.splits.include.file.footer true
spark.sql.orc.cache.stripe.details.size 10000
spark.sql.hive.metastorePartitionPruning true

additional good reads

https://www.slideshare.net/BenjaminLeonhardi/hive-loading-data

what remains open

more information on choosing the right parameters
why are stripe sizes in spark so much smaller (16-20MB vs. recommended 64MB)? Maybe I need to try to tinker with the stride size.
why do they stay so small (even when trying to increase them). Remember: 2.2.x, HDP 2.6.4 and native ORC support should already be fixed.
when to use bloomfilters, when are these overkill?

https://www.slideshare.net/BenjaminLeonhardi/hive-loading-data

spark ORC fine tuning (file size, stripes)

how can I set (fine tune) advanced ORC parameters using spark?

choosing the right parameters:

file size

stripe size

orc goodies

other tuning tips

1 Answers

incomplete learnings

how to set tuning parameters in spark

choosing parameters

file size & stripe size

orc goodies

parameters

additional good reads

what remains open