0
votes

I have to change the schema of a ORC file. The ORC is kept on an adls location.
The original schema in orc file is Old Schema Column headers : (C1 , C2 , C3 , C4 )
I want to overide original schema with the new schema (created from StructType and StructField.) New Schema Column headers : (Name , Age , Sex , Time)

The spark command i am using is : val df2 = spark.read.format("orc").schema(schema).load("path/")

as soon as i run df2.show(2,false)

The data for all the columns become null.

When i do not override the already present old schema and run

val df2 = spark.read.format("orc").load("path/")

i get the data but the column headers are C1, C2 , C3 and C4.

Could you please tell me how to read data in the new schema and why it is not working ?

Thank you in advance.

1
The schema comes from the ORC file, you can't control that. If you want to re-name the columns you'll need to use a second dataframe.Andrew

1 Answers

0
votes

why it is not working ?

Yes, this is expected behavior. Given your source df has columns c1, c2 ... etc. The .schema(...) while reading helps you select certain columns or cast them. This will only work if the give columns exist in the source. This option is mostly useful for text based formats like csv, json, text etc.

Since you are giving columns as (Name , Age , Sex , Time) and your source does not contain these columns, the data is null.

Could you please tell me how to read data in the new schema

Read the file normally,

val df = spark.read.format("orc").load("path/")

Explicitly rename the columns,

val df2 = df.withColumnRenamed(c1, "Name").withColumnRenamed(c2, "Age") ...