I have a JSON like
{ 1234 : "blah1", 9807: "blah2", 467: "blah_k", ...}
written to a gzipped file. It is a mapping of one ID space to another where the keys are int
s and values are string
s.
I want to load it as a DataFrame
in Spark.
I loaded it as,
val df = spark.read.format("json").load("my_id_file.json.gz")
By default, Spark loaded it with a schema that looks like
|-- 1234: string (nullable = true)
|-- 9807: string (nullable = true)
|-- 467: string (nullable = true)
Instead, I want to my DataFrame
to look like
+----+------+
|id1 |id2 |
+----+------+
|1234|blah1 |
|9007|blah2 |
|467 |blah_k|
+----+------+
So, I tried the following.
import org.apache.spark.sql.types._
val idMapSchema = StructType(Array(StructField("id1", IntegerType, true), StructField("id2", StringType, true)))
val df = spark.read.format("json").schema(idMapSchema).load("my_id_file.json.gz")
However, the loaded data frame looks like
scala> df.show
+----+----+
|id1 |id2 |
+----+----+
|null|null|
+----+----+
How can I specify the schema to fix this? Is there a "pure" dataframe approach (without creating an RDD and then creating DataFrame)?