If I use Spark to write data out to S3 (or HDFS), I get a bunch of part files
part-r-xxxxx-uuid.snappy.parquet
I understand the xxxxx is a map/reduce task number and generally starts at zero and counts upwards.
Is there any valid, non-error scenario where there would be a part-r-00001 output file but no part-r-00000 output file? Or a part-r-00002 output file but no part-r-00001 file?
I have a Spark job that does multiple append writes to a S3/HDFS directory. I can see two part-r-00002 files but only a single part-r-00001 file. Does this mean that there is an error? Or could that be a completely valid scenario?
One guess is that the data may be partitioned to 0,1,2 workers, and some of those partitions may not have data, and not generate a corresponding output file. Is that true?
EDIT: Here is a specific example. Note how the index numbers go 0,1,31,32. Is this S3 directory listing evidence of a bug? Is there some proof that this is a bug?
2016-10-28 14:22:14 6521048 part-r-00000-a597e173-4e27-4c1a-88c2-2b02150b07fe.avro
2016-10-28 14:16:39 2486221729 part-r-00001-a597e173-4e27-4c1a-88c2-2b02150b07fe.avro
2016-10-28 16:39:24 7044366 part-r-00031-a597e173-4e27-4c1a-88c2-2b02150b07fe.avro
2016-10-28 16:33:50 2460258711 part-r-00032-a597e173-4e27-4c1a-88c2-2b02150b07fe.avro