I am new to spark and scala. I have a json array struct as input, similar to the below schema.
root
|-- entity: struct (nullable = true)
| |-- email: string (nullable = true)
| |-- primaryAddresses: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- postalCode: string (nullable = true)
| | | |-- streetAddress: struct (nullable = true)
| | | | |-- line1: string (nullable = true)
I flattened the array struct to the below sample Dataframe
+-------------+--------------------------------------+--------------------------------------+
|entity.email |entity.primaryAddresses[0].postalCode |entity.primaryAddresses[1].postalCode |....
+-------------+--------------------------------------+--------------------------------------+
|[email protected] | | |
|[email protected] | |12345 |
|[email protected] |12345 | |
|[email protected] |0 |0 |
+-------------+--------------------------------------+--------------------------------------+
My end goal is to calculate presence/absence/zero counts for each of the columns for data quality metrics.But before I calculate the data quality metrics I am looking for an approach to derive one new column for each of the array column elements as below such that
- if all values of particular array element is empty, then the derived column is empty for that element
- if at least one value is present for an array element, the element presence is considered 1
- if all values of an array element is zero the I mark the element as zero (I calibrate this as presence =1 and zero =1 when I calculate data quality later)
Below is a sample intermediate dataframe that I am trying to achieve with a column derived for each of array elements. The original array elements are dropped.
+-------------+--------------------------------------+
|entity.email |entity.primaryAddresses.postalCode |.....
+-------------+--------------------------------------+
|[email protected] | |
|[email protected] |1 |
|[email protected] |1 |
|[email protected] |0 |
+-------------+--------------------------------------+
The input json records elements are dynamic and can change. To derive columns for array element I build a scala map with a key as column name without array index (example:entity.primaryAddresses.postalCode) and value as list of array elements to run rules on for the specific key. I am looking for an approach to achieve the above intermediate data frame.
One concern is that for certain input files after I flatten the Dataframe , the dataframe column count exceeds 70k+. And since the record count is expected to be in millions I am wondering if instead of flattening the json if I should explode each of elements for better performance.
Appreciate any ideas. Thank you.