Spark error when accessing an empty or null array

Question

I have a JSON file with this type of schema:

{
 "name" : "john doe",
 "phone-numbers" : {
   "home": ["1111", "222"],
   "country" : "England" 
  }
}

The home phone numbers array could sometimes be empty.

My spark application receives a list of these JSONS and does this:

val dataframe = spark.read.json(filePaths: _*)
val result = dataframe.select($"name", 
                               explode(dataframe.col("phone-numbers.home")))

When the 'home' array is empty, I receive the following error when I try to explode it:

org.apache.spark.sql.AnalysisException: cannot resolve 'phone-numbers['home']' due to data type mismatch: argument 2 requires integral type, however, ''home'' is of string type.;;

Is there an elegant way to prevent spark from exploding this field if it's empty or null?

What would you want to have as a result of no or empty home numbers? — Jacek Laskowski

Haroun Mohammedi Haroun Mohammedi · Accepted Answer · 2017-05-21T08:45:40

In spark there's a class called DataFrameNaFunctions, this class is specialized for working with missing data in DataFrames.

this class contains three essentials method : drop, replace and fill

to use this methods the only thing that you have to do is to call the df.na method wich return a DataFrameNaFunctions for your df then apply one of the three method which returns your df with the specified operation.

to resolve your problem you can use something like that :

val dataframe = spark.read.json(filePaths: _*)
val result = dataframe.na.drop().select("name", 
                           explode(dataframe.col("phone-numbers.home")))

Hope this help, Best Regards

Spark error when accessing an empty or null array

2 Answers