Parquet error when saving from Spark

Question

After repartitioning a DataFrame in Spark 1.3.0 I get a .parquet exception when saving to Amazon's S3.

logsForDate
    .repartition(10)
    .saveAsParquetFile(destination) // <-- Exception here

The exception I receive is:

java.io.IOException: The file being written is in an invalid state. Probably caused by an error thrown previously. Current state: COLUMN
at parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:137)
at parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:129)
at parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:173)
at parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:152)
at parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:635)
at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:649)
at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:649)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

I would like to know what is the problem and how to solve it.

Do you get the error every time or just sometimes? Do you also get it for smaller files? Do you only get it on S3 or other file systems as well? Have you tried Apache Spark 1.3.1? Its release notes mention some Parquet-related fixes. — Daniel Darabos
I get the error all the time when working above a certain file size. I have only tried S3. I have tried 1.3.0.d. — Interfector
I am able to reproduce this error with Spark 1.3.1 on EMR, writing to S3. Using the old Parquet api (sqlContext.setConf("spark.sql.parquet.useDataSourceApi", "false")) does not help. Writing to HDFS works fine. — Eric Eijkelenboom
did you try using a bucket in us-west-1 region ? or using emfrs — Gaurav Shah
@Interfector we you able to solve this problem. I have the same issue — User12345

Eric Eijkelenboom Eric Eijkelenboom · Accepted Answer · 2015-05-06T11:02:50

I can actually reproduce this problem with Spark 1.3.1 on EMR, when saving to S3.

However, saving to HDFS works fine. You could save to HDFS first, and then use e.g. s3distcp to move the files to S3.

Parquet error when saving from Spark

3 Answers