1
votes

Is there a common method to change nullable property for all elements of any specified StructType? it might be nested StructType.

I saw @eliasah marked it as duplicate with Spark Dataframe column nullable property change. But they are different, as it cannot solve hierarchy/nested StructType, that answer is only for one level.

for example:

 root
 |-- user_id: string (nullable = false)
 |-- name: string (nullable = false)
 |-- system_process: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- timestamp: long (nullable = false)
 |    |    |-- process: string (nullable = false)
 |-- type: string (nullable = false)
 |-- user_process: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- timestamp: long (nullable = false)
 |    |    |-- process: string (nullable = false)

I want to change nullalbe to true to all elements, the result should be :

 root
 |-- user_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- system_process: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- timestamp: long (nullable = true)
 |    |    |-- process: string (nullable = true)
 |-- type: string (nullable = true)
 |-- user_process: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- timestamp: long (nullable = true)
 |    |    |-- process: string (nullable = true)

Attached is a sample of JSON schema of StructType for convenience testing:

val jsonSchema="""{"type":"struct","fields":[{"name":"user_id","type":"string","nullable":false,"metadata":{}},{"name":"name","type":"string","nullable":false,"metadata":{}},{"name":"system_process","type":{"type":"array","elementType":{"type":"struct","fields":[{"name":"timestamp","type":"long","nullable":false,"metadata":{}},{"name":"process_id","type":"string","nullable":false,"metadata":{}}]},"containsNull":false},"nullable":false,"metadata":{}},{"name":"type","type":"string","nullable":false,"metadata":{}},{"name":"user_process","type":{"type":"array","elementType":{"type":"struct","fields":[{"name":"timestamp","type":"long","nullable":false,"metadata":{}},{"name":"process_id","type":"string","nullable":false,"metadata":{}}]},"containsNull":false},"nullable":false,"metadata":{}}]}"""
DataType.fromJson(jsonSchema).asInstanceOf[StructType].printTreeString()
1

1 Answers

1
votes

Finally figured two solutions out as follow:

  1. Trick one to replace string first, then create StructType instance from JSON string

    DataType.fromJson(schema.json.replaceAll("\"nullable\":false", "\"nullable\":true")).asInstanceOf[StructType]
    
  2. Recurisive approach

      def updateFieldsToNullable(structType: StructType): StructType = {
        StructType(structType.map(f => f.dataType match {
          case d: ArrayType =>
            val element = d.elementType match {
              case s: StructType => updateFieldsToNullable(s)
              case _ => d.elementType
            }
            f.copy(nullable = true, dataType = ArrayType(element, d.containsNull))
          case s: StructType => f.copy(nullable = true, dataType = updateFieldsToNullable(s))
          case _ => f.copy(nullable = true)
        })
        )
      }