I have been running my Spark job on a local cluster which has hdfs from where the input is read and the output is written too. Now I have set up an AWS EMR and an S3 bucket where I have my input and I want my output to be written to S3 too.
The error:
User class threw exception: java.lang.IllegalArgumentException: Wrong FS: s3://something/input, expected: hdfs://ip-some-numbers.eu-west-1.compute.internal:8020
I tried searching for the same issue and there are several questions regarding this issue. Some suggested that it's only for the output, but even when I disable output I get the same error.
Another suggestion is that there is something wrong with FileSystem
in my code. Here are all of the occurances of input/output in my program:
The first occurance is in my custom FileInputFormat
, in getSplits(JobContext job)
which I have not actually modified myself but I can:
FileSystem fs = path.getFileSystem(job.getConfiguration());
Similar case in my custom RecordReader
, also have not modified myself:
final FileSystem fs = file.getFileSystem(job);
In nextKeyValue()
of my custom RecordReader
which I have written myself I use:
FileSystem fs = FileSystem.get(jc);
And finally when I want to detect the number of files in a folder I use:
val fs = FileSystem.get(sc.hadoopConfiguration)
val status = fs.listStatus(new Path(path))
I assume the issue is with my code, but how can I modify the FileSystem
calls to support input/output from S3?