Hadoop ORC file - How it works - How to fetch metadata

Question

I am new to ORC file. I went through many blogs, but didn't get clear understanding. Please help and clarify below questions.

Can I fetch schema from ORC file? I know in Avro, schema can fetched.
How it actually provides schema evolution? I know that few columns can be added. But how to do it. The only I know, creating orc file is by loading data into hive table which store data in orc format.
How ORC files index works? What I know is for every stripe index will be maintained. But as file is not sorted how it helps looking up data in list of stripes. How it helps in skipping stripes while looking up for the data?
Is index maintained for every column. If yes, then is it not going to consume more memory?
How columnar format ORC file can fit into hive table, where values of each columns are stored together. whereas hive table is made to fetch record by record. How both will fit together?

Samson Scharfrichter Samson Scharfrichter · Accepted Answer · 2015-08-10T20:59:12

1. and 2. Use Hive and/or HCatalog to create, read, update ORC table structure in the Hive metastore (HCatalog is just a side door than enables Pig/Sqoop/Spark/whatever to access the metastore directly)

2. ALTER TABLE command allows to add/drop columns whatever the storage type, ORC included. But beware of a nasty bug that may crash vectorized reads after that (at least in V0.13 and V0.14)

3. and 4. The term "index" is rather inappropriate. Basically it's just min/max information persisted in the stripe footer at write time, then used at read time for skipping all stripes that are clearly not meeting the WHERE requirements, drastically reducing I/O in some cases (a trick that has become popular in columns stores e.g. InfoBright on MySQL, but also in Oracle Exadata appliances [dubbed "smart scan" by Oracle marketing])

5. Hive works with "row store" formats (Text, SequenceFile, AVRO) and "column store" formats (ORC, Parquet) alike. The optimizer just uses specific strategies and shortcuts on the initial Map phase -- e.g. stripe elimination, vectorized operators -- and of course the serialization/deserialization phases are a bit more elaborate with column stores.

Hadoop ORC file - How it works - How to fetch metadata

2 Answers