1
votes

I wanted to create an external table and load data into it through pig script. I followed the below approach:


Ok. Create a external hive table with a schema layout somewhere in HDFS directory. Lets say

create external table emp_records(id int,
                              name String,
                              city String)
                              row formatted delimited 
                              fields terminated by '|'
                              location '/user/cloudera/outputfiles/usecase1';

Just create a table like above and no need to load any file into that directory.

Now write a Pig script that we read data for some input directory and then when you store the output of that Pig script use as below

A =  LOAD 'inputfile.txt' USING PigStorage(',') AS(id:int,name:chararray,city:chararray);
B = FILTER A by id > = 678933;
C = FOREACH B GENERATE id,name,city;
STORE C INTO '/user/cloudera/outputfiles/usecase1' USING PigStorage('|');

Ensure that destination location and delimiter and schema layout of final FOREACH statement in you Pigscript matches with Hive DDL schema.


My problem is, when I first created the table, it is creating a directory in hdfs, and when I tried to store a file using script, it throws an error saying "folder already exists". It looks like pig store always writes to a new directory with only specific name?

Is there any way to avoid this issue?

And are there any other attributes we can use with STORE command in PIG to write to a specific diretory/file everytime?

Thanks Ram

1

1 Answers

1
votes

YES you can use the HCatalog for achieving your result.

remember you have to run your Pig script like:

pig -useHCatalog your_pig_script.pig

or if you are using grunt shell then simply use:

pig -useHCatalog

next is your store command to store your relation directly into hive tables use:

STORE C INTO 'HIVE_DATABASE.EXTERNAL_TABLE_NAME' USING org.apache.hive.hcatalog.pig.HCatStorer();