DataFrame saveAsTable is persisting all the column values properly but insertInto function is not storing all the columns especially json data is truncated and sub-sequent column in not stored hive table.
Our Environment
- Spark 2.2.0
- EMR 5.10.0
- Scala 2.11.8
The sample data is
a8f11f90-20c9-11e8-b93e-2fc569d27605 efe5bdb3-baac-5d8e-6cae57771c13 Unknown E657F298-2D96-4C7D-8516-E228153FE010 NonDemarcated {"org-id":"efe5bdb3-baac-5d8e-6cae57771c13","nodeid":"N02c00056","parkingzoneid":"E657F298-2D96-4C7D-8516-E228153FE010","site-id":"a8f11f90-20c9-11e8-b93e-2fc569d27605","channel":1,"type":"Park","active":true,"tag":"","configured_date":"2017-10-23
23:29:11.20","vs":[5.0,1.7999999523162842,1.5]}
DF SaveAsTable
val spark = SparkSession.builder().appName("Spark SQL Test").
config("hive.exec.dynamic.partition", "true").
config("hive.exec.dynamic.partition.mode", "nonstrict").
enableHiveSupport().getOrCreate()
val zoneStatus = spark.table("zone_status")
zoneStatus.select(col("site-id"),col("org-id"), col("groupid"), col("zid"), col("type"), lit(0), col("config"), unix_timestamp().alias("ts")).
write.mode(SaveMode.Overwrite).saveAsTable("dwh_zone_status")
Stored data properly in result table:
a8f11f90-20c9-11e8-b93e-2fc569d27605 efe5bdb3-baac-5d8e-6cae57771c13 Unknown E657F298-2D96-4C7D-8516-E228153FE010 NonDemarcated 0 {"org-id":"efe5bdb3-baac-5d8e-6cae57771c13","nodeid":"N02c00056","parkingzoneid":"E657F298-2D96-4C7D-8516-E228153FE010","site-id":"a8f11f90-20c9-11e8-b93e-2fc569d27605","channel":1,"type":"Park","active":true,"tag":"","configured_date":"2017-10-23 23:29:11.20","vs":[5.0,1.7999999523162842,1.5]} 1520453589
DF insertInto
zoneStatus.
select(col("site-id"),col("org-id"), col("groupid"), col("zid"), col("type"), lit(0), col("config"), unix_timestamp().alias("ts")).
write.mode(SaveMode.Overwrite).insertInto("zone_status_insert")
But insertInto is not persisting all the contents. The json string is storing partially and sub-sequent column is not stored.
a8f11f90-20c9-11e8-b93e-2fc569d27605 efe5bdb3-baac-5d8e-6cae57771c13 Unknown E657F298-2D96-4C7D-8516-E228153FE010 NonDemarcated 0 {"org-id":"efe5bdb3-baac-5d8e-6cae57771c13" NULL
We are using insertInto functions in our projects and recently encountered when parsing json data to pull other metrics. We noticed that the config content is not stored fully. Planning to change to saveAsTable but we can avoid the code change, if any workaround available to add in spark configuration.