2
votes

I'm working on Windows command line as problems with Unix and firewalls prevent gsutil from working. I can read my Google Cloud Storage files and copy them over to other buckets (which I don't need to do). What I'm wondering is how to download them directly into HDFS (which I'm 'ssh'ing into)? Has anyone done this? Ideally this is part one, part two is creating Hive tables for the Google Cloud Storage data so we can use HiveQL and Pig.

1
Forgot to mention, I can download them straight into my C:Users/Me ...and then upload to HDFS, but rather cut this process to be tighter by going straight to hdfs. Thanx - sadiemac
On a Linux box you could pipe the GS download into the HDFS upload on a single command line. On Windows, though... - Samson Scharfrichter
BTW, what do you mean exactly about "ssh-ing into HDFS"?!? Are you using a command line such as hdfs dfs -put, the WebHDFS REST interface, or some Microsoft contraption? - Samson Scharfrichter
yep using Unix to access Hadoop, then using hdfs dfs -commands. Completely unable to connect through gsutil on Unix though, so the middle integration bit has me stumped. do you know an angry engineer who might have a solution? - sadiemac
Ah, I forgot a caveat: when you upload a file with hdfs dfs or hadoop distcp a temp file name is used until upload is complete. Not so with WebHDFS: the file is created under its real name, and if it is larger than 1 block (e.g. 128 MB) then it will be visible to other HDFS clients as soon as the DataNode notifies the NameNode that block #1 is flushed. So it might be detected, and read, while incomplete (especially if your upload link has low bandwidth). - Samson Scharfrichter

1 Answers

1
votes

You can use the Google Cloud Storage connector which provides an HDFS-API compatible interface to your data already in Google Cloud Storage, so you don't even need to copy it anywhere, just read from and write directly to your Google Cloud Storage buckets/objects.

Once you set up the connector, you can also copy data between HDFS and Google Cloud Storage with the hdfs tool, if necessary.