I have a use case where I have to load millions of json formatted data into Apache Hive Tables. So my solution was simply , load them into dataframe and write them as Parquet files . Then I shall create an external table on them .
I am using Apache Spark 2.1.0 with scala 2.11.8.
It so happens all the messages follow a sort of flexible schema . For example , a column "amount" can have value - 1.0 or 1 .
Since I am transforming data from semi-structured format to structured format but my schema is slightly variable , I have compensated by thinking inferSchema option for datasources like json will help me .
spark.read.option("inferSchema","true").json(RDD[String])
When I have used inferSchema as true while reading json data ,
case 1 : for smaller data , all the parquet files have amount as double .
case 2 : For larger data , some parquet files have amount as double and others have int64 .
I tried to debug and found certain concepts like schema evolution and schema merging which went over my head leaving me with more doubts than answers.
My doubts/questions are
When I try to infer schema , does it not enforce the inferred schema onto full dataset ?
Since I cannot enforce any schema due to my contraints , I thought to cast the whole column to double datatype as it can have both integers and decimal numbers . Is there a simpler way ?
My guess being ,Since the data is partitioned , the inferSchema works per partition and then it gives me a general schema but it does not do anything like enforcing schema or anything of such sort . Please correct me if I am wrong .
Note : The reason I am using inferSchema option is because the incoming data is too much flexible/variable to enforce a case class of my own though some of the columns are mandatory . If you have a simpler solution, please suggest .