3
votes

I have a JSON file with this type of schema:

{
 "name" : "john doe",
 "phone-numbers" : {
   "home": ["1111", "222"],
   "country" : "England" 
  }
}

The home phone numbers array could sometimes be empty.

My spark application receives a list of these JSONS and does this:

val dataframe = spark.read.json(filePaths: _*)
val result = dataframe.select($"name", 
                               explode(dataframe.col("phone-numbers.home")))

When the 'home' array is empty, I receive the following error when I try to explode it:

org.apache.spark.sql.AnalysisException: cannot resolve 'phone-numbers['home']' due to data type mismatch: argument 2 requires integral type, however, ''home'' is of string type.;;

Is there an elegant way to prevent spark from exploding this field if it's empty or null?

2
What's the Spark version?Jacek Laskowski
What would you want to have as a result of no or empty home numbers?Jacek Laskowski

2 Answers

2
votes

In spark there's a class called DataFrameNaFunctions, this class is specialized for working with missing data in DataFrames.

this class contains three essentials method : drop, replace and fill

to use this methods the only thing that you have to do is to call the df.na method wich return a DataFrameNaFunctions for your df then apply one of the three method which returns your df with the specified operation.

to resolve your problem you can use something like that :

val dataframe = spark.read.json(filePaths: _*)
val result = dataframe.na.drop().select("name", 
                           explode(dataframe.col("phone-numbers.home")))

Hope this help, Best Regards

2
votes

The problem are not empty arrays ("home" : []) but arrays which are null ("home" : null) which do not work with explode

So either filter the null-values first:

val result = df
   .filter($"phone-numbers.home".isNotNull)
   .select($"name", explode($"phone-numbers.home"))

or replace the null-values with an empty array (which I would prefer in your situaion):

val nullToEmptyArr = udf(
   (arr:Array[Long]) => if(arr==null) Array.empty[Long] else arr
)

val result = df
  .withColumn("phone-numbers.home",nullToEmptyArr($"phone-numbers.home")) // clean existing column
  .select($"name", explode($"phone-numbers.home"))