I wanted to create an external table and load data into it through pig script. I followed the below approach:
Ok. Create a external hive table with a schema layout somewhere in HDFS directory. Lets say
create external table emp_records(id int,
name String,
city String)
row formatted delimited
fields terminated by '|'
location '/user/cloudera/outputfiles/usecase1';
Just create a table like above and no need to load any file into that directory.
Now write a Pig script that we read data for some input directory and then when you store the output of that Pig script use as below
A = LOAD 'inputfile.txt' USING PigStorage(',') AS(id:int,name:chararray,city:chararray);
B = FILTER A by id > = 678933;
C = FOREACH B GENERATE id,name,city;
STORE C INTO '/user/cloudera/outputfiles/usecase1' USING PigStorage('|');
Ensure that destination location and delimiter and schema layout of final FOREACH statement in you Pigscript matches with Hive DDL schema.
My problem is, when I first created the table, it is creating a directory in hdfs, and when I tried to store a file using script, it throws an error saying "folder already exists". It looks like pig store always writes to a new directory with only specific name?
Is there any way to avoid this issue?
And are there any other attributes we can use with STORE command in PIG to write to a specific diretory/file everytime?
Thanks Ram