2
votes

I'm trying to execute a map-reduce job using Spark 1.6 (spark-1.6.0-bin-hadoop2.4.tgz) that reads input from and writes output to S3.

The reads are working just fine with: sc.textFile(s3n://bucket/path/to/file/file.gz)

However, I'm having a bunch of trouble getting the writes to work. I'm using the same bucket to output the files: outputRDD.saveAsTextFile(s3n://bucket/path/to/output/)

When my input is extremely small (< 100 records), this seems to work fine. I'm seeing a part-NNNNN file written per partition with some of those files having 0 bytes and the rest being under 1 KB. Spot checking the non-empty files shows the correctly formatted map-reduce output. When I move to a slightly bigger input (~500 records), I'm seeing the same number of part-NNNNN files (my number of partitions are constant for these experiments), but each one is empty.

When I was experimenting with much bigger data sets (millions of records), my thought was that I was exceeding some S3 limits which was causing this problem. However, 500 records (which amounts to ~65 KB zipped) is still a trivially small amount of data that I would think Spark and S3 should handle easily.

I've tried using the S3 Block FileSystem instead of the S3 Native FileSystem as outlined here. But get the same results. I've turned on logging for my S3 bucket, but I can't seem to find a smoking gun there.

Has anyone else experienced this? Or can otherwise give me a clue as to what might be going wrong?

2
Typically empty output means that partitions are empty. Does the problem persist when you use repartition(n) before saveAsTextFile? If not you have your answer. If not please post a code which can be used to reproduce the problem. See on-topic and How to create a Minimal, Complete, and Verifiable examplezero323
Yes. That turned out to be the answer. I was filtering out all my records in my business logic. Which seems obvious now that I think about it. Thanks for the response.Tim K

2 Answers

0
votes

Turns out I was working on this too late at night. This morning, I took a step back and found a bug in my map-reduce which was effectively filtering out all the results.

-3
votes

You should use coalesce before saveAsTextFile

from spark programming guide

Decrease the number of partitions in the RDD to numPartitions. Useful
for running operations more efficiently after filtering down a large
dataset.

eg:

outputRDD.coalesce(100).saveAsTextFile(s3n://bucket/path/to/output/)