0
votes

In Oozie Hive2 action, i am trying to load hive table from '.csv' files present in compressed '.zip' file. In order to read the files inside *.zip through Oozie Hive action workflow, Hive action provides 'archive' tag element. Just need to declare the Zip file in 'archive' tag element as below,

<archive>${ZipfilePath}#unzipFile</archive>

Reference after '#' in 'archive' element is the name of the temporary folder to read unzipped files. The .csv files inside the .zip can be read by referring the path 'unzipFile/.csv'

Issue is - Hive action unable to find the path referred in archive element. By default, Hive looks for unzip folder in "hdfs://nameservice1/user/hive/" location and error as

"Error: Error while compiling statement: FAILED: SemanticException Line     1:17 Invalid path ''unzipFile/file.csv'': No files matching path hdfs://nameservice1/user/hive/unzipFile/file.csv (state=42000,code=40000"

But, I was able to successfully test 'archive' tag using shell action and 'cat' the file as

cat unzipFile/file.csv
2
Oh my. The Oozie <archive> instruction works like the Hadoop command-line -archives option or the Hive add archives command. It's meant to ship packaged libraries and/or configuration files. Not data files.Samson Scharfrichter
Hadoop does not support ZIP for data files, because it's primarily an archive format, with many files packaged in the same ZIP. And that shatters the whole MapReduce paradigm. So you must unZIP your stuff before loading it in HDFS (note that you can GZip individual files, the .gz extension is recognized automatically)Samson Scharfrichter
Thanks for your response. Just because oozie Shell action was able to read through the *.csv from an archive, i was convinced it should work same for Hive.SaRu

2 Answers

0
votes

Since Oozie hive action is running in the cluster and not on the edge node, all supporting files needs to be in hdfs path. That is hive action itself will be running in any node selected by Oozie at runtime. Upload the file to hdfs path so that it is accessible from any node in the cluster

0
votes

Shell action will copy the file locally into the container which runs the script. This is why it does work.

Hive2:

LOAD DATA [LOCAL] INPATH

Once the file has been moved into the container you have to use LOCAL.