2
votes

I have a bunch of CSV files that are being loaded into HDFS in ORC format using the ETL tool Informatica. After the load into HDFS, I wanted to extract the metadata (column names, data types) of the ORC files.

But when I loaded the ORC files into Spark dataframes, all the columns are being evaludated as string type.

Sample Data:

ID|Course|Enrol_Date|Credits
123|Biology|21-03-2012 07:34:56|24
908|Linguistics|05-02-2012 11:02:36|15
564|Computer Science|18-03-2012 09:48:09|30
341|Philosophy|23-01-2012 18:12:44|10
487|Math|10-04-2012 17:00:46|20

I'm using the below commands to achieve this:

df = sqlContext.sql("SELECT * FROM orc.`<HDFS_path>`");
df.printSchema()

Sample output:

root
 |-- ID: string (nullable = true)
 |-- Course: string (nullable = true)
 |-- Enrol_Date: string (nullable = true)
 |-- Credits: string (nullable = true)

I'm totally new to Spark and HDFS. I'm trying to understand why every column is result in string type. Is this the normal behaviour when creating ORCs with csv source files (irrespective of which tool we use to do it)? Or am I not doing something correctly in spark that is causing this?

2

2 Answers

2
votes

By default, spark reads all fields as StringType . You can try below:

For inferring schema,

val data = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("<path>.csv")

For providing custom schema

import org.apache.spark.sql.types._

val customSchema = StructType(Array(
  StructField("col1", StringType, true),
  StructField("col2", IntegerType, true),
  StructField("col3", DoubleType, true))
)

val data = spark.read.format("csv").option("header", "true").schema(customSchema).load("<path>.csv")
0
votes

You should cast or use schema while import CSV file through Informatica. Since Spark ORC format does not infer Schema automatically like Spark CSV format. ORC format take schema from source file schema as it is.

Since you have not used any schema in Informatica, it has written data in default String DataType which is further taken by ORC.

There are two possible way to resolve issue:

  1. Either use Schema in CSV file (transform columns that should have data type other that String) in Informatica/Spark and load into ORC.

  2. Or use Struct or Casting in Spark to change datatype of ORC file for required columns.

Sample Demonstration:

Below is sample demonstration of How spark work with Schema. You can resemble logic of Schema of source CSV file in Informatica same as Spark give as below

Case 1: Default loading CSV file and write into ORC

scala> val df = spark.read.format("csv").option("header","true").load("/spath/stack2.csv")

//Default schema uses by Spark or Informatica for CSV file

scala> df.printSchema
root
 |-- ID: string (nullable = true)
 |-- Course: string (nullable = true)
 |-- Enrol_Date: string (nullable = true)
 |-- Credits: string (nullable = true)

//Have loaded same CSV file into ORC

scala> df.write.format("orc").mode("overwrite").save("/spath/AP_ORC")

scala> val orc = spark.read.format("orc").load("/spath/AP_ORC")

//Schema is same as Source CSV file

scala> orc.printSchema
root
 |-- ID: string (nullable = true)
 |-- Course: string (nullable = true)
 |-- Enrol_Date: string (nullable = true)
 |-- Credits: string (nullable = true)

Case 2:Transformation/inferring Schema datatype for CSV file and write into ORC

//Inferring Schema or Transform/casting of CSV data in Spark or Informatica respectively. 

scala> val df = spark.read.format("csv").option("header","true").option("inferschema", "true").load("/spath/stack2.csv")

//Transformed Schema
scala> df.printSchema
root
 |-- ID: integer (nullable = true)
 |-- Course: string (nullable = true)
 |-- Enrol_Date: string (nullable = true)
 |-- Credits: integer (nullable = true)

//Have loaded same CSV file into ORC

scala> df.write.format("orc").mode("overwrite").save("/spath/AP_ORC")

scala> val orc = spark.read.format("orc").load("/spath/AP_ORC")

 //Schema is same as Source CSV file

scala> orc.printSchema
root
 |-- ID: integer (nullable = true)
 |-- Course: string (nullable = true)
 |-- Enrol_Date: string (nullable = true)
 |-- Credits: integer (nullable = true)