You should cast
or use schema
while import CSV
file through Informatica. Since Spark ORC
format does not infer Schema
automatically like Spark CSV
format. ORC format take schema from source file schema as it is.
Since you have not used any schema
in Informatica, it has written data in default String
DataType which is further taken by ORC
.
There are two possible way to resolve issue:
Either use Schema in CSV file (transform columns that should have
data type other that String
) in Informatica
/Spark
and load into ORC
.
Or use Struct
or Casting
in Spark to change datatype of ORC
file for required columns.
Sample Demonstration:
Below is sample demonstration of How spark work with Schema
. You can resemble logic of Schema of source CSV
file in Informatica
same as Spark
give as below
Case 1: Default loading CSV file and write into ORC
scala> val df = spark.read.format("csv").option("header","true").load("/spath/stack2.csv")
//Default schema uses by Spark or Informatica for CSV file
scala> df.printSchema
root
|-- ID: string (nullable = true)
|-- Course: string (nullable = true)
|-- Enrol_Date: string (nullable = true)
|-- Credits: string (nullable = true)
//Have loaded same CSV file into ORC
scala> df.write.format("orc").mode("overwrite").save("/spath/AP_ORC")
scala> val orc = spark.read.format("orc").load("/spath/AP_ORC")
//Schema is same as Source CSV file
scala> orc.printSchema
root
|-- ID: string (nullable = true)
|-- Course: string (nullable = true)
|-- Enrol_Date: string (nullable = true)
|-- Credits: string (nullable = true)
Case 2:Transformation/inferring Schema datatype for CSV file and write into ORC
//Inferring Schema or Transform/casting of CSV data in Spark or Informatica respectively.
scala> val df = spark.read.format("csv").option("header","true").option("inferschema", "true").load("/spath/stack2.csv")
//Transformed Schema
scala> df.printSchema
root
|-- ID: integer (nullable = true)
|-- Course: string (nullable = true)
|-- Enrol_Date: string (nullable = true)
|-- Credits: integer (nullable = true)
//Have loaded same CSV file into ORC
scala> df.write.format("orc").mode("overwrite").save("/spath/AP_ORC")
scala> val orc = spark.read.format("orc").load("/spath/AP_ORC")
//Schema is same as Source CSV file
scala> orc.printSchema
root
|-- ID: integer (nullable = true)
|-- Course: string (nullable = true)
|-- Enrol_Date: string (nullable = true)
|-- Credits: integer (nullable = true)