Spark (v 2.3.2) dataframe is reading all the columns in an ORC file as string type. is this a normal behaviour?

Question

I have a bunch of CSV files that are being loaded into HDFS in ORC format using the ETL tool Informatica. After the load into HDFS, I wanted to extract the metadata (column names, data types) of the ORC files.

But when I loaded the ORC files into Spark dataframes, all the columns are being evaludated as string type.

Sample Data:

ID|Course|Enrol_Date|Credits
123|Biology|21-03-2012 07:34:56|24
908|Linguistics|05-02-2012 11:02:36|15
564|Computer Science|18-03-2012 09:48:09|30
341|Philosophy|23-01-2012 18:12:44|10
487|Math|10-04-2012 17:00:46|20

I'm using the below commands to achieve this:

df = sqlContext.sql("SELECT * FROM orc.`<HDFS_path>`");
df.printSchema()

Sample output:

root
 |-- ID: string (nullable = true)
 |-- Course: string (nullable = true)
 |-- Enrol_Date: string (nullable = true)
 |-- Credits: string (nullable = true)

I'm totally new to Spark and HDFS. I'm trying to understand why every column is result in string type. Is this the normal behaviour when creating ORCs with csv source files (irrespective of which tool we use to do it)? Or am I not doing something correctly in spark that is causing this?

Sarath Chandra Vema Sarath Chandra Vema · Accepted Answer · 2019-10-16T10:03:08

By default, spark reads all fields as StringType . You can try below:

For inferring schema,

val data = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("<path>.csv")

For providing custom schema

import org.apache.spark.sql.types._

val customSchema = StructType(Array(
  StructField("col1", StringType, true),
  StructField("col2", IntegerType, true),
  StructField("col3", DoubleType, true))
)

val data = spark.read.format("csv").option("header", "true").schema(customSchema).load("<path>.csv")

Spark (v 2.3.2) dataframe is reading all the columns in an ORC file as string type. is this a normal behaviour?

2 Answers