Spark: correct schema to load JSON as DataFrame

Question

I have a JSON like

{ 1234 : "blah1", 9807: "blah2", 467: "blah_k", ...}

written to a gzipped file. It is a mapping of one ID space to another where the keys are ints and values are strings.

I want to load it as a DataFrame in Spark.

I loaded it as,

val df = spark.read.format("json").load("my_id_file.json.gz")

By default, Spark loaded it with a schema that looks like

 |-- 1234: string (nullable = true)
 |-- 9807: string (nullable = true)
 |-- 467: string (nullable = true)

Instead, I want to my DataFrame to look like

+----+------+
|id1 |id2   |
+----+------+
|1234|blah1 |
|9007|blah2 |
|467 |blah_k|    
+----+------+

So, I tried the following.

import org.apache.spark.sql.types._
val idMapSchema = StructType(Array(StructField("id1", IntegerType, true), StructField("id2", StringType, true)))

val df = spark.read.format("json").schema(idMapSchema).load("my_id_file.json.gz")

However, the loaded data frame looks like

scala> df.show
+----+----+
|id1 |id2 |
+----+----+
|null|null|
+----+----+

How can I specify the schema to fix this? Is there a "pure" dataframe approach (without creating an RDD and then creating DataFrame)?

Prasad Khode Prasad Khode · Accepted Answer · 2018-09-12T06:30:05

One way to achieve this is to read the input file as textFile and apply your parsing logic within map() and then convert the result to dataframe

import scala.collection.JavaConversions._
import scala.collection.mutable.ListBuffer

val rdd = sparkSession.sparkContext.textFile("my_input_file_path")
  .map(row => {
    val list = new ListBuffer[String]()
    val inputJson = new JSONObject(row)

    for (key <- inputJson.keySet()) {
      val resultJson = new JSONObject()
      resultJson.put("col1", key)
      resultJson.put("col2", inputJson.get(key))

      list += resultJson.toString()
    }

    list
  }).flatMap(row => row)

val df = sparkSession.read.json(rdd)
df.printSchema()
df.show(false)

output:

root
 |-- col1: string (nullable = true)
 |-- col2: string (nullable = true)

+----+------+
|col1|col2  |
+----+------+
|1234|blah1 |
|467 |blah_k|
|9807|blah2 |
+----+------+

Spark: correct schema to load JSON as DataFrame

1 Answers