How to group by on a field inside an array of an array of records?

Question

I have the following schema -

[name: StringType, grades: ArrayType( StructType( StructField(subject_grades, ArrayType(StructType(StructField(subject,StringType,false), StructField(grade,LongType,false)]

I want to groupby on the subject field inside the subject_grades array which is inside the grades array.

I tried

sql.sql("select ... from grades_table group by grades.subject_grades.subject")

but I get

org.apache.spark.sql.AnalysisException: cannot resolve 'grades.subject_grades[subject]' due to data type mismatch: argument 2 requires integral type, however, 'subject' is of string type.;

I understand why I get this error, however I was hoping I could avoid exploding the entire thing in order to group by on the inner field.

Jacek Laskowski Jacek Laskowski · Accepted Answer · 2017-07-01T16:10:18

Arrays are (relatively) hard to work with and beg for explode (or flatMap) to work with them when the main query requires the elements inside, e.g. for grouping.

Something I learnt from the question is that the following clause with subject_grades being of type ArrayType is translated to a clause with subject being the index and hence the requirement of integral type.

group by grades.subject_grades.subject

I'd see no other way but using explode (or flatMap) to "destructure" the subject_grades array and do the grouping.

How to group by on a field inside an array of an array of records?

1 Answers