2
votes

I currently have an issue adding a folders contents to Hives distrusted cache. I can successfully add multiple files to the distributed cache in Hive using:

ADD FILE /folder/file1.ext;
ADD FILE /folder/file2.ext;
ADD FILE /folder/file3.ext;
etc.

.

I also see that there is a ADD FILES (plural) option which in my mind means you could specify a directory like: ADD FILES /folder/; and everything in the folder gets included (this works with Hadoop Streaming -files option). But this does not work with Hive. Right now I have to explicitly add each file.

Am I doing this wrong? Is there a way to had a whole folders contents to the distributed cache.

P.S. I tried wild cards ADD FILE /folder/* and ADD FILES /folder/* but that fails too.

Edit:

As of hive 0.11 this now supported so:

ADD FILE /folder

now works.

What I am using is passing the folder location to the hive script as a param so:

$ hive -f my-query.hql -hiveconf folder=/folder

and in the my-query.hql file:

ADD FILE ${hiveconf:folder}

Nice and tidy now!

2

2 Answers

3
votes

Add doesn't support directories, but as a workaround you can zip the files. Then add the it to the distributed cache as an archive (ADD ARCHIVE my.zip). When the job is running the content of the archive will be unpacked on the local job directory of the slave nodes (see the mapred.job.classpath.archives property)

If the number of the files you want to pass is relatively small, and you don't want deal with archives you can also write a small script which prepares the add file command for all the files you have in a given directory:
E.g:

#!/bin/bash
#list.sh

if [ ! "$1" ]
then
  echo "Directory is missing!"
  exit 1
fi

ls -d $1/* | while read f; do echo ADD FILE $f\;; done

Then invoke it from the Hive shell and execute the generated output:

!/home/user/list.sh /path/to/files
0
votes

Well, in my case, I had to move a folder with child folders and files in it.

I used the ADD ARCHIVE xxx.gz, which was adding the file, but was not exploding(unzipping) in the slave machines.

Instead, ADD FILE <folder_name_without_traling_slash> actually copies the whole folder recursively to the slaves.

Courtesy: The comments helped debugging

Hope this helps !