Spark SQL on ORC files doesn't return correct Schema (Column names)

Question

I have a directory containing ORC files. I am creating a DataFrame using the below code

var data = sqlContext.sql("SELECT * FROM orc.`/directory/containing/orc/files`");

It returns data frame with this schema

[_col0: int, _col1: bigint]

Where as the expected schema is

[scan_nbr: int, visit_nbr: bigint]

When I query on files in parquet format I get correct schema.

Am I missing any configuration(s)?

Adding more details

This is Hortonworks Distribution HDP 2.4.2 (Spark 1.6.1, Hadoop 2.7.1, Hive 1.2.1)

We haven't changed the default configurations of HDP, but this is definitely not the same as the plain vanilla version of Hadoop.

Data is written by upstream Hive jobs, a simple CTAS (CREATE TABLE sample STORED AS ORC as SELECT ...).

I tested this on filed generated by CTAS with the latest 2.0.0 hive & it preserves the column names in the orc files.

Works just fine for me. Can you add some detail? Configuration, versions, how you saved data... — zero323
My 2 cents: check that all files have the exact same schema; since Hive stores the "theoretical" column names in its MetaStore, it might ignore discrepancies; but since you require Spark to open the files directly, if these files show different column names, it might fall back to column numbers... cf. cwiki.apache.org/confluence/display/Hive/… — Samson Scharfrichter
@SamsonScharfrichter looking at the orc file dump, its clear that the column names in the file is stored as _col0, _col1. The upstream hive job that produces these files is a CTAS statement. Do we need to pass any additional configurations to Hive while creating these ? — Ramu Malur
@SamsonScharfrichter Thanks for the pointer on orc file dump. This helped me identify the root cause. — Ramu Malur

Ramu Malur Ramu Malur · Accepted Answer · 2016-10-10T12:44:46

The problem is the Hive version, which is 1.2.1, which has this bug HIVE-4243

This was fixed in 2.0.0.