2
votes

One of the Json field (age below) meant to be a number represented as null is coming up as string in Dataframe printschema

input json file

{"AGE":null,"NAME":"abc","BATCH":190}
{"AGE":null,"NAME":"abc","BATCH":190}

Spark Code and output

val df = spark.read.json("/home/white/tmp/a.json")
df.printSchema()
df.show()

*********************
OUTPUT
*********************
root
 |-- BATCH: long (nullable = true)
 |-- AGE: string (nullable = true)
 |-- NAME: string (nullable = true)

+-----+----+----+
|BATCH|AGE|NAME|
+-----+----+----+
|  190|null| abc|
|  190|null| abc|
+-----+----+----+

I want age to be a long and currently I am achieving this by creating a new StructType with age field as Long and recreating the Dataframe as df.sqlContext.createDataFrame( df.rdd, newSchema ). Can I get this done while spark.read.json api directly?

3
I should add how to represent integer value as null in json so spark can understand a particular field is integer. - xstack2000

3 Answers

1
votes

I think the easiest way to do this is as follows:

spark.read.json("/home/white/tmp/a.json").withColumn("AGE", 'AGE.cast(LongType))

This produces the following schema:

root
 |-- AGE: long (nullable = true)
 |-- BATCH: long (nullable = true)
 |-- NAME: string (nullable = true)

Spark makes a best guess on types, and it makes sense that it will see the null in the JSON and think "string" since String lies on the nullable AnyRef side of the Scala object hierarchy while Long lies on the non-nullable AnyVal side. You just need to cast the column to make Spark treat your data as you see fit.

Incidentally, why are you using Long rather than Int for ages? Those people must eat very healthy.

0
votes

You can create a case class and provide that to read.json method to be filled up. This will provide you DataSet (not dataframe)

case class Person(batch: Long, age: Long, name: String)
val df = spark.read.json("/home/white/tmp/a.json").as[Person]

Reference: http://spark.apache.org/docs/latest/sql-programming-guide.html#creating-datasets

Another option is to create your own InputReader to use instead of using standard JSON reader. Last option which you are already doing is to add extra step to convert types.

0
votes

If you already know which types are there I recommend to read using a predefined schema.

import org.apache.spark.sql.types._
val schema = StructType(List(
    StructField("AGE", IntegerType, nullable = true),
    StructField("BATCH", StringType, nullable = true),
    StructField("NAME", StringType, nullable = true)
))

spark.read.schema(schema).json("/home/white/tmp/a.json")