1
votes

I am trying to load Google Cloud Storage files to on premise Hadoop cluster. I developed a workaround (program) to download files on local EdgeNode and distcp to Hadoop. But this seems two-way workaround and not much impressive. I have gone through few websites (links1, link2) which summarizes using Hadoop Google Cloud Storage connector for such process and need infrastructure level configuration which is not possible in all cases.

Is there any way to copy files directly from Cloud Storage to Hadoop programmatically using Python or Java.

1
You should be able to use distcp directly from GCS into Hadoop as long as you add GCS credentials into the core-site.xml, otherwise, yes you can use Spark or native Hadoop API to copy InputStreams from GCS to outputStreams of HDFSOneCricketeer
@cricket_007 - I think distcp will require infrastructure level configuration. I am not sure it will be permitted. Do you have sample example using Spark?Sandeep Singh
Not sure what you mean by "infrastructure configuration"... And any Spark example that reads from GCS will work. You then write out a dataframe or RDD to HDFSOneCricketeer
Would installing the Cloud Storage Connector on your Hadoop Cluster be an acceptable solution for you? - cloud.google.com/dataproc/docs/concepts/connectors/…Philipp Sh

1 Answers

0
votes

To do this programmatically you can use Cloud Storage API client libraries directly to download files from Cloud Storage and save them to HDFS.

But it will be much simpler and easier to install Cloud Storage connector on your on premise Hadoop cluster and use DistCp to download files from Cloud Storage to HDFS.