I am trying to load Parquet File in Spark as dataframe-
val df= spark.read.parquet(path)
I am getting -
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 12, 10.250.2.32): java.lang.UnsupportedOperationException: Complex types not supported.
While going through the code, I realized there is a check in spark VectorizedParquetRecordReader.java (initializeInternal)-
Type t = requestedSchema.getFields().get(i);
if (!t.isPrimitive() || t.isRepetition(Type.Repetition.REPEATED)) {
throw new UnsupportedOperationException("Complex types not supported.");
}
So I think its failing on isRepetition method. Can anybody suggest me the way to solve the issue ?
My Parquet Data is like -
Key1 = value1
Key2 = value1
Key3 = value1
Key4:
.list:
..element:
...key5:
....list:
.....element:
......certificateSerialNumber = dfsdfdsf45345
......issuerName = CN=Microsoft Windows Verification PCA, O=Microsoft Corporation, L=Redmond, S=Washington, C=US
......subjectName = CN=Microsoft Windows, OU=MOPR, O=Microsoft Corporation, L=Redmond, S=Washington, C=US
......thumbprintAlgorithm = Sha1
......thumbprintContent = sfdasf42dsfsdfsdfsd
......validFrom = 2009-12-07 21:57:44.000000
......validTo = 2011-03-07 21:57:44.000000
....list:
.....element:
......certificateSerialNumber = dsafdsafsdf435345
......issuerName = CN=Microsoft Root Certificate Authority, DC=microsoft, DC=com
......subjectName = CN=Microsoft Windows Verification PCA, O=Microsoft Corporation, L=Redmond, S=Washington, C=US
......thumbprintAlgorithm = Sha1
......thumbprintContent = sdfsdfdsf43543
......validFrom = 2005-09-15 21:55:41.000000
......validTo = 2016-03-15 22:05:41.000000
And I suspect the key4 may be raising the issue because of nested tree. The input data is of Json type, so may be parquet doesn't understand that complex levels as Json
I found a bug in Spark https://issues.apache.org/jira/browse/HIVE-13744
but it states Hive Complex Type Issue. Not Sure, this will fix the issue with parquet or not?
Update 1 Further exploring the parquet and I concluded following -
I have 5 parquet file created while spark.write Among that 2 parquet file is empty so the schema for a column which was supposed to be ArrayType is coming as String type and when I am trying to read it as whole, I saw the above exception