I want to merge multiple files generated by a Spark job into one file. Usually I'd do something like:
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
val deleteSrcFiles = true
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), deleteSrcFiles, hadoopConfig, null)
This runs fine locally, using paths e.g /tmp/some/path/to.csv, but results in an Exception when executed on a cluster my-cluster:
Wrong FS: gs://myBucket/path/to/result.csv, expected: hdfs://my-cluster-m
Is this possible to get a FileSystem for gs:// paths from scala/java code running on a Dataproc cluster?
EDIT
Found the google-storage client library: https://cloud.google.com/storage/docs/reference/libraries#client-libraries-install-java