DF insertInto is not persisting all columns for mixed structured data ( json, string)

Question

DataFrame saveAsTable is persisting all the column values properly but insertInto function is not storing all the columns especially json data is truncated and sub-sequent column in not stored hive table.

Our Environment

Spark 2.2.0
EMR 5.10.0
Scala 2.11.8

The sample data is

 a8f11f90-20c9-11e8-b93e-2fc569d27605   efe5bdb3-baac-5d8e-6cae57771c13 Unknown E657F298-2D96-4C7D-8516-E228153FE010    NonDemarcated       {"org-id":"efe5bdb3-baac-5d8e-6cae57771c13","nodeid":"N02c00056","parkingzoneid":"E657F298-2D96-4C7D-8516-E228153FE010","site-id":"a8f11f90-20c9-11e8-b93e-2fc569d27605","channel":1,"type":"Park","active":true,"tag":"","configured_date":"2017-10-23
 23:29:11.20","vs":[5.0,1.7999999523162842,1.5]}

DF SaveAsTable

val spark = SparkSession.builder().appName("Spark SQL Test").
config("hive.exec.dynamic.partition", "true").
config("hive.exec.dynamic.partition.mode", "nonstrict").
enableHiveSupport().getOrCreate()

val zoneStatus = spark.table("zone_status")

zoneStatus.select(col("site-id"),col("org-id"), col("groupid"), col("zid"), col("type"), lit(0), col("config"), unix_timestamp().alias("ts")).
write.mode(SaveMode.Overwrite).saveAsTable("dwh_zone_status")

Stored data properly in result table:

a8f11f90-20c9-11e8-b93e-2fc569d27605    efe5bdb3-baac-5d8e-6cae57771c13 Unknown E657F298-2D96-4C7D-8516-E228153FE010    NonDemarcated   0   {"org-id":"efe5bdb3-baac-5d8e-6cae57771c13","nodeid":"N02c00056","parkingzoneid":"E657F298-2D96-4C7D-8516-E228153FE010","site-id":"a8f11f90-20c9-11e8-b93e-2fc569d27605","channel":1,"type":"Park","active":true,"tag":"","configured_date":"2017-10-23 23:29:11.20","vs":[5.0,1.7999999523162842,1.5]} 1520453589

DF insertInto

zoneStatus.
  select(col("site-id"),col("org-id"), col("groupid"), col("zid"), col("type"), lit(0), col("config"), unix_timestamp().alias("ts")).
  write.mode(SaveMode.Overwrite).insertInto("zone_status_insert")

But insertInto is not persisting all the contents. The json string is storing partially and sub-sequent column is not stored.

a8f11f90-20c9-11e8-b93e-2fc569d27605    efe5bdb3-baac-5d8e-6cae57771c13 Unknown E657F298-2D96-4C7D-8516-E228153FE010    NonDemarcated   0   {"org-id":"efe5bdb3-baac-5d8e-6cae57771c13"  NULL

We are using insertInto functions in our projects and recently encountered when parsing json data to pull other metrics. We noticed that the config content is not stored fully. Planning to change to saveAsTable but we can avoid the code change, if any workaround available to add in spark configuration.

Vinayak Dornala Vinayak Dornala · Accepted Answer · 2018-03-07T23:53:34

You can use below alternative ways of inserting data into table.

val zoneStatusDF = zoneStatus.
  select(col("site-id"),col("org-id"), col("groupid"), col("zid"), col("type"), lit(0), col("config"), unix_timestamp().alias("ts"))

zoneStatusDF.registerTempTable("zone_status_insert ")

Or

zoneStatus.sqlContext.sql("create table zone_status_insert as select * from zone_status")

DF insertInto is not persisting all columns for mixed structured data ( json, string)

2 Answers