How to load field with null value from json as number in Dataframe

Question

One of the Json field (age below) meant to be a number represented as null is coming up as string in Dataframe printschema

input json file

{"AGE":null,"NAME":"abc","BATCH":190}
{"AGE":null,"NAME":"abc","BATCH":190}

Spark Code and output

val df = spark.read.json("/home/white/tmp/a.json")
df.printSchema()
df.show()

*********************
OUTPUT
*********************
root
 |-- BATCH: long (nullable = true)
 |-- AGE: string (nullable = true)
 |-- NAME: string (nullable = true)

+-----+----+----+
|BATCH|AGE|NAME|
+-----+----+----+
|  190|null| abc|
|  190|null| abc|
+-----+----+----+

I want age to be a long and currently I am achieving this by creating a new StructType with age field as Long and recreating the Dataframe as df.sqlContext.createDataFrame( df.rdd, newSchema ). Can I get this done while spark.read.json api directly?

I should add how to represent integer value as null in json so spark can understand a particular field is integer. — xstack2000

Vidya Vidya · Accepted Answer · 2017-04-06T04:25:53

I think the easiest way to do this is as follows:

spark.read.json("/home/white/tmp/a.json").withColumn("AGE", 'AGE.cast(LongType))

This produces the following schema:

root
 |-- AGE: long (nullable = true)
 |-- BATCH: long (nullable = true)
 |-- NAME: string (nullable = true)

Spark makes a best guess on types, and it makes sense that it will see the null in the JSON and think "string" since String lies on the nullable AnyRef side of the Scala object hierarchy while Long lies on the non-nullable AnyVal side. You just need to cast the column to make Spark treat your data as you see fit.

Incidentally, why are you using Long rather than Int for ages? Those people must eat very healthy.

How to load field with null value from json as number in Dataframe

3 Answers