part-r-xxxxx files in Spark

Question

If I use Spark to write data out to S3 (or HDFS), I get a bunch of part files

part-r-xxxxx-uuid.snappy.parquet

I understand the xxxxx is a map/reduce task number and generally starts at zero and counts upwards.

Is there any valid, non-error scenario where there would be a part-r-00001 output file but no part-r-00000 output file? Or a part-r-00002 output file but no part-r-00001 file?

I have a Spark job that does multiple append writes to a S3/HDFS directory. I can see two part-r-00002 files but only a single part-r-00001 file. Does this mean that there is an error? Or could that be a completely valid scenario?

One guess is that the data may be partitioned to 0,1,2 workers, and some of those partitions may not have data, and not generate a corresponding output file. Is that true?

EDIT: Here is a specific example. Note how the index numbers go 0,1,31,32. Is this S3 directory listing evidence of a bug? Is there some proof that this is a bug?

2016-10-28 14:22:14    6521048 part-r-00000-a597e173-4e27-4c1a-88c2-2b02150b07fe.avro
2016-10-28 14:16:39 2486221729 part-r-00001-a597e173-4e27-4c1a-88c2-2b02150b07fe.avro
2016-10-28 16:39:24    7044366 part-r-00031-a597e173-4e27-4c1a-88c2-2b02150b07fe.avro
2016-10-28 16:33:50 2460258711 part-r-00032-a597e173-4e27-4c1a-88c2-2b02150b07fe.avro

Tim Tim · Accepted Answer · 2016-11-03T01:37:24

Spark will generally generate a part-r-${taskIndex} file for each task, regardless of whether that task contains an empty iterator.

Spark touches a file called _SUCCESS when it finishes writing. If that file is not there, then something went wrong in the write step. This file is in the same directory as the part-r-xxxxx files.

EDIT: I didn't realize you were using write.partitionBy. I just tested this myself:

scala> case class MyData(key: String, value: String)
scala> sc.parallelize(Range(0, 100000)).map(x => MyData((x / 1000).toString, "foo"))
scala> res0.toDF().write.partitionBy("key").parquet("file:///.../pqt_test")

When I investigated that structure, I got task files separated by key just like you:

pqt_test/key=87/part-r-00228-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=87/part-r-00227-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=87/part-r-00226-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=78/part-r-00203-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=78/part-r-00205-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=78/part-r-00202-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=78/part-r-00204-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=71/part-r-00184-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=71/part-r-00187-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=71/part-r-00185-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=71/part-r-00186-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=40/part-r-00105-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=40/part-r-00104-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=40/part-r-00106-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=33/part-r-00085-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=33/part-r-00088-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=33/part-r-00086-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=33/part-r-00087-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=65/part-r-00169-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=65/part-r-00170-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=65/part-r-00171-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=12/part-r-00033-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=12/part-r-00032-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=12/part-r-00031-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=19/part-r-00051-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=19/part-r-00050-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=19/part-r-00049-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=39/part-r-00103-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=39/part-r-00102-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=39/part-r-00101-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=58/part-r-00153-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=58/part-r-00152-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=58/part-r-00150-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
pqt_test/key=58/part-r-00151-5c1a24f5-09cb-4faf-99a6-eeb568cd9018.gz.parquet
...

Conclusion: this is fine. As long as you have a _SUCCESS file in the same directory as the key=... folders, your write was successful.

part-r-xxxxx files in Spark

1 Answers