I want to filter Spark sql.DataFrame leaving only wanted array elements without any knowledge for the whole schema before hand (don't want to hardcode it). Schema:
root
|-- callstartcelllabel: string (nullable = true)
|-- calltargetcelllabel: string (nullable = true)
|-- measurements: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- enodeb: string (nullable = true)
| | |-- label: string (nullable = true)
| | |-- ltecelloid: long (nullable = true)
|-- networkcode: long (nullable = true)
|-- ocode: long (nullable = true)
|-- startcelllabel: string (nullable = true)
|-- startcelloid: long (nullable = true)
|-- targetcelllabel: string (nullable = true)
|-- targetcelloid: long (nullable = true)
|-- timestamp: long (nullable = true)
I want whole root only with particular measurements (which are filtered on) and root must contain at least one after filtering.
I have a dataframe of this root, and I have a dataframe of filtering values (one column).
So, example: I would only know that my root contains measurements array, and this array contains labels. So I want whole root with whole measurements which contains labels ("label1","label2").
last trial with explode and collect_list leads to: grouping expressions sequence is empty, and 'callstartcelllabel' is not an aggregate function... Is it even possible to generalize such filtering case ? Don't know how such generic udaf should look like yet.
New in Spark.
EDIT:
Current solution I've came to is:
explode array -> filter out unwanted rows with unwanted array members -> groupby everything but array members -> agg.(collect_list(col("measurements"))
Would it be faster doing it with udf ? I can't figure out how to make a generic udf filtering generic array, knowing only about filtering values...