When running Hadoop on Google Compute Engine with the Google Cloud Storage connector for Hadoop as the "default filesystem", the GCS connector is able to be treated exactly the same way HDFS is treated, including for usage in the DistributedCache. So, to access files in Google Cloud Storage, you'd use it exactly the same way you would use HDFS, no need to change anything. For example, if you had deployed your cluster with your GCS connector's CONFIGBUCKET
set to foo-bucket
, and you had local files you wanted to place in the DistributedCache, you'd do:
# Copies mylib.jar into gs://foo-bucket/myapp/mylib.jar
$ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
And in your Hadoop job:
JobConf job = new JobConf();
// Retrieves gs://foo-bucket/myapp/mylib.jar as a cached file.
DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
If you want to access files in a different bucket than your CONFIGBUCKET
, you just need to specify a full path, using gs://
instead of hdfs://
:
# Copies mylib.jar into gs://other-bucket/myapp/mylib.jar
$ bin/hadoop fs -copyFromLocal mylib.jar gs://other-bucket/myapp/mylib.jar
and then in Java
JobConf job = new JobConf();
// Retrieves gs://other-bucket/myapp/mylib.jar as a cached file.
DistributedCache.addFileToClassPath(new Path("gs://other-bucket/myapp/mylib.jar"), job);