2
votes

I have data in azure storage blob which is in parquet format. What I need to do is to transfer all those storage files to a hdfs. Is there any way I can do that?

couldn't find any helpful method to do it,

Thanks.

2

2 Answers

1
votes

using @jay's solution I was able to transfer data using following command.

command:

hadoop  distcp -D fs.azure.account.key.<account name>.blob.core.windows.net=<Key> wasb://<container>@<account>.blob.core.windows.net<path to wasb file> hdfs://<hdfs path>

distcp copies directory structure recursively for more info read this link

0
votes

Based on the statements in this link,actually,in Hadoop, an entire file system hierarchy is stored in a single container.

You could configure your account key and container name as below:

<property>
  <name>fs.azure.account.key.youraccount.blob.core.windows.net</name>
  <value>YOUR ACCESS KEY</value>
</property>

enter image description here

So only you need to do just copy the files into the configured container with AzCopy.

More details,please refer this document.


Update Answer:

I provide a solution here for you:

1.InstallBlobFuse on your VM to provide a virtual filesystem backed by your Azure Blob storage Container.

2.Then use cp command to copy files from container directly to HDFS URL.

In addition,just write a snippet of java code to grab data from azure blob storage to dump into HDFS.

Just for summary, please use command:

hadoop  distcp -D fs.azure.account.key.<account name>.blob.core.windows.net=<Key> wasb://<container>@<account>.blob.core.windows.net<path to wasb file> hdfs://<hdfs path>

distcp copies directory structure recursively for more info read this link