I'm trying to execute a map-reduce job using Spark 1.6 (spark-1.6.0-bin-hadoop2.4.tgz
) that reads input from and writes output to S3.
The reads are working just fine with: sc.textFile(s3n://bucket/path/to/file/file.gz)
However, I'm having a bunch of trouble getting the writes to work. I'm using the same bucket to output the files: outputRDD.saveAsTextFile(s3n://bucket/path/to/output/)
When my input is extremely small (< 100 records), this seems to work fine. I'm seeing a part-NNNNN
file written per partition with some of those files having 0 bytes and the rest being under 1 KB. Spot checking the non-empty files shows the correctly formatted map-reduce output. When I move to a slightly bigger input (~500 records), I'm seeing the same number of part-NNNNN
files (my number of partitions are constant for these experiments), but each one is empty.
When I was experimenting with much bigger data sets (millions of records), my thought was that I was exceeding some S3 limits which was causing this problem. However, 500 records (which amounts to ~65 KB zipped) is still a trivially small amount of data that I would think Spark and S3 should handle easily.
I've tried using the S3 Block FileSystem instead of the S3 Native FileSystem as outlined here. But get the same results. I've turned on logging for my S3 bucket, but I can't seem to find a smoking gun there.
Has anyone else experienced this? Or can otherwise give me a clue as to what might be going wrong?
repartition(n)
beforesaveAsTextFile
? If not you have your answer. If not please post a code which can be used to reproduce the problem. See on-topic and How to create a Minimal, Complete, and Verifiable example – zero323