I have a spark DF with rows of Seq[(String, String, String)]
. I'm trying to do some kind of a flatMap
with that but anything I do try ends up throwing
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to scala.Tuple3
I can take a single row or multiple rows from the DF just fine
df.map{ r => r.getSeq[Feature](1)}.first
returns
Seq[(String, String, String)] = WrappedArray([ancient,jj,o], [olympia_greece,nn,location] .....
and the data type of the RDD seems correct.
org.apache.spark.rdd.RDD[Seq[(String, String, String)]]
The schema of the df is
root
|-- article_id: long (nullable = true)
|-- content_processed: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- lemma: string (nullable = true)
| | |-- pos_tag: string (nullable = true)
| | |-- ne_tag: string (nullable = true)
I know this problem is related to spark sql treating the RDD rows as org.apache.spark.sql.Row
even though they idiotically say that it's a Seq[(String, String, String)]
. There's a related question (link below) but the answer to that question doesn't work for me. I am also not familiar enough with spark to figure out how to turn it into a working solution.
Are the rows Row[Seq[(String, String, String)]]
or Row[(String, String, String)]
or Seq[Row[(String, String, String)]]
or something even crazier.
I'm trying to do something like
df.map{ r => r.getSeq[Feature](1)}.map(_(1)._1)
which appears to work but doesn't actually
df.map{ r => r.getSeq[Feature](1)}.map(_(1)._1).first
throws the above error. So how am I supposed to (for instance) get the first element of the second tuple on each row?
Also WHY has spark been designed to do this, it seems idiotic to claim that something is of one type when in fact it isn't and can not be converted to the claimed type.
Related question: GenericRowWithSchema exception in casting ArrayBuffer to HashSet in DataFrame to RDD from Hive table
Related bug report: http://search-hadoop.com/m/q3RTt2bvwy19Dxuq1&subj=ClassCastException+when+extracting+and+collecting+DF+array+column+type