0
votes

I'm a beginner in Spark. I've a scenario where there are multiple source of data at different point of time for an analysis. Can I have 2 spark jobs to use a single HDFS/S3 storage at the same time? One job will write latest data to S3/HDFS and other will read that along with input data from another source for analysis.

2
Your title says: "Can 2 Spark job use a single HDFS/S3 storage simultaneously?" but your description referencing multiple sources. Is your question about access 1 data source from two jobs or [something else]? - Matt Andruff

2 Answers

0
votes

Yes, you can be writing and reading to the same data source. Data will only be present once writes are completed.(In both HDFS/S3)

0
votes

In order to use both file systems, you need to include the protocol for the files.

e.g. spark.read.path("s3a://bucket/file") and/or spark.write.path("hdfs:///tmp/data")

However, you can use S3 directly in place of HDFS via setting fs.defaultFS