3
votes

I have a directory containing ORC files. I am creating a DataFrame using the below code

var data = sqlContext.sql("SELECT * FROM orc.`/directory/containing/orc/files`");

It returns data frame with this schema

[_col0: int, _col1: bigint]

Where as the expected schema is

[scan_nbr: int, visit_nbr: bigint]

When I query on files in parquet format I get correct schema.

Am I missing any configuration(s)?

Adding more details

This is Hortonworks Distribution HDP 2.4.2 (Spark 1.6.1, Hadoop 2.7.1, Hive 1.2.1)

We haven't changed the default configurations of HDP, but this is definitely not the same as the plain vanilla version of Hadoop.

Data is written by upstream Hive jobs, a simple CTAS (CREATE TABLE sample STORED AS ORC as SELECT ...).

I tested this on filed generated by CTAS with the latest 2.0.0 hive & it preserves the column names in the orc files.

5
Works just fine for me. Can you add some detail? Configuration, versions, how you saved data... - zero323
Added More details - Ramu Malur
My 2 cents: check that all files have the exact same schema; since Hive stores the "theoretical" column names in its MetaStore, it might ignore discrepancies; but since you require Spark to open the files directly, if these files show different column names, it might fall back to column numbers... cf. cwiki.apache.org/confluence/display/Hive/… - Samson Scharfrichter
@SamsonScharfrichter looking at the orc file dump, its clear that the column names in the file is stored as _col0, _col1. The upstream hive job that produces these files is a CTAS statement. Do we need to pass any additional configurations to Hive while creating these ? - Ramu Malur
@SamsonScharfrichter Thanks for the pointer on orc file dump. This helped me identify the root cause. - Ramu Malur

5 Answers

7
votes

The problem is the Hive version, which is 1.2.1, which has this bug HIVE-4243

This was fixed in 2.0.0.

1
votes

Setting

sqlContext.setConf('spark.sql.hive.convertMetastoreOrc', 'false')

fixes this.

0
votes

If you have the parquet version as well, you can just copy the column names over, which is what I did (also, the date column was partition key for orc so had to move it to the end):

tx = sqlContext.table("tx_parquet")
df = sqlContext.table("tx_orc")
tx_cols = tx.schema.names
tx_cols.remove('started_at_date')
tx_cols.append('started_at_date') #move it to end
#fix column names for orc
oldColumns = df.schema.names
newColumns = tx_cols
df = functools.reduce(
    lambda df, idx: df.withColumnRenamed(
        oldColumns[idx], newColumns[idx]), range(
            len(oldColumns)), df)
0
votes

We can use:

val df = hiveContext.read.table("tableName")

Your df.schema or df.columns will give actual column names.

-2
votes

If version upgrade is not an available option, quick fix could be to rewrite ORC file using PIG. That seems to work just fine.