I am trying to read an ElasticSearch index which has millions of docs each having variable number of fields. I have a schema that has 1000's of fields each with its own name and type.
Now when I create a RDD trough ES-Hadoop connector and later convert into a DataFrame by specifying the schema, it fails saying -
Input row doesn't have expected number of values required by the schema
I have a few questions. 1. Is it possible to have a RDD/DF with rows containing variable number of fields? If not, what is the alternative other than adding null value for missing fields in each column?
I see that by default Spark converts everything into
StringTypeas I usesc.newAPIHadoopRDD()call. How can I typecast them to correct type based on the field name that I have in my schema? Some kind of mapping?I want to write this in Parquet format with schema added into the file. What happens to those missing fields compared to the schema which has 1000's of fields.