I'm having some difficulties to make sure I'm leveraging sorted data within a Hive table. (Using ORC file format)
I understand we can affect how the data is read from a Hive table, by declaring a DISTRIBUTE BY clause in the create DDL.
CREATE TABLE trades
(
trade_id INT,
name STRING,
contract_type STRING,
ts INT
)
PARTITIONED BY (dt STRING)
CLUSTERED BY (trade_id) SORTED BY (trade_id, time) INTO 8 BUCKETS
STORED AS ORC;
This will mean that every time I make a query to this table, the data will be distributed by trade_id among the various mappers and afterward it will be sorted.
My question is:
I do not want the data to be split into N files (buckets), because the volume is not that much and I would stay with small files.
However, I do want to leverage sorted insertion.
INSERT OVERWRITE TABLE trades
PARTITION (dt)
SELECT trade_id, name, contract_type, ts, dt
FROM raw_trades
DISTRIBUTE BY trade_id
SORT BY trade_id;
Do I really need to use CLUSTERED/SORT in the create DLL statement? Or does Hive/ORC knows how to leverage the fact that the insertion process already ensured that the data is sorted?
Could it make sense to do something like:
CLUSTERED BY (trade_id) SORTED BY (trade_id, time) INTO 1 BUCKETS