1
votes

I want to merge multiple files generated by a Spark job into one file. Usually I'd do something like:

val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
val deleteSrcFiles = true
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), deleteSrcFiles, hadoopConfig, null)

This runs fine locally, using paths e.g /tmp/some/path/to.csv, but results in an Exception when executed on a cluster my-cluster:

Wrong FS: gs://myBucket/path/to/result.csv, expected: hdfs://my-cluster-m

Is this possible to get a FileSystem for gs:// paths from scala/java code running on a Dataproc cluster?


EDIT

Found the google-storage client library: https://cloud.google.com/storage/docs/reference/libraries#client-libraries-install-java

1

1 Answers

2
votes

You can only use Path's belonging to a particular filesystem with that filesystem e.g. you cannot pass a gs:// path to HDFS as you did above.

The following snippet works for me:

val hadoopConfig = new Configuration()
val srcPath = new Path("hdfs:/tmp/foo")
val hdfs = srcPath.getFileSystem(hadoopConfig)
val dstPath = new Path("gs://bucket/foo")
val gcs = dstPath.getFileSystem(hadoopConfig)
val deleteSrcFiles = true
FileUtil.copyMerge(hdfs, srcPath, gcs, dstPath, deleteSrcFiles, hadoopConfig, null)