I'm using Spark 1.5.2 with Java, and I'm attempting to read in a parquet file that contains data that originated from a JSON file. I'm having difficulties figuring out how I should read in a field that originally contained nested JSON, but now is a WrappedArray<WrappedArray<String>>. I've looked through the Spark pages for reading in Parquet files, but none of the examples seemed to match what I was looking for. I did some searching, and found things that were close, but specific to scala.
Here is an example of the original JSON:
{"page_number":1,"id_groups":[{"ids":["60537"]},{"ids":["65766","7368815"]}]}
The field I'm having an issue reading in is the id_groups field. I read the parquet file in, and did a show. The schema looks like this:
StructField(id_groups,ArrayType(StructType(StructField(ids,ArrayType(StringType,true),true)),true),true))
I'm guessing that I need to create a schema for that field, but I can't figure out how to do that using the Spark Java API.
This post seemed promising (shows scala code creating a schema for nested data), but I don't know how to replicate something similar using Java.
spark-specifying-schema-for-nested-json
Any suggestions on how to read the id_groups data from the parquet file?
IntelliJ shows, while stepping through the code, that the id_groups field is a WrappedArray<WrappedArray<String>>.
selectnested field? - zero323