I'm working on Windows command line as problems with Unix and firewalls prevent gsutil from working. I can read my Google Cloud Storage files and copy them over to other buckets (which I don't need to do). What I'm wondering is how to download them directly into HDFS (which I'm 'ssh'ing into)? Has anyone done this? Ideally this is part one, part two is creating Hive tables for the Google Cloud Storage data so we can use HiveQL and Pig.
2
votes
1 Answers
1
votes
You can use the Google Cloud Storage connector which provides an HDFS-API compatible interface to your data already in Google Cloud Storage, so you don't even need to copy it anywhere, just read from and write directly to your Google Cloud Storage buckets/objects.
Once you set up the connector, you can also copy data between HDFS and Google Cloud Storage with the hdfs tool, if necessary.
hdfs dfs -put, the WebHDFS REST interface, or some Microsoft contraption? - Samson Scharfrichterhdfs dfsorhadoop distcpa temp file name is used until upload is complete. Not so with WebHDFS: the file is created under its real name, and if it is larger than 1 block (e.g. 128 MB) then it will be visible to other HDFS clients as soon as the DataNode notifies the NameNode that block #1 is flushed. So it might be detected, and read, while incomplete (especially if your upload link has low bandwidth). - Samson Scharfrichter