0
votes

I am very new to Pig , so facing some issues while trying to perform very basic processing in Pig.

1- Load that file using Pig

2- Write a processing logic to filter records based on Date , for example the lines have 2 columns col_1 and col_2 ( assume the columns are chararray ) and I need to get only the records which are having 1 day difference between col_1 and col_2.

3- Finally store that filtered record in Hive table .

Input file ( tab separated ) :-

2016-01-01T16:31:40.000+01:00   2016-01-02T16:31:40.000+01:00
2017-01-01T16:31:40.000+01:00   2017-01-02T16:31:40.000+01:00

When I try

 A = LOAD '/user/inp.txt' USING  PigStorage('\t') as (col_1:chararray,col_2:chararray);

The result I am getting like below :- DUMP A;

(,2016-01-03T19:28:58.000+01:00,2016-01-02T16:31:40.000+01:00)
(,2017-01-03T19:28:58.000+01:00,2017-01-02T16:31:40.000+01:00)

Not sure Why ? Please can some one help me in this how to parse tab separated file and how to covert that chararray to Date and filter based on Day difference ?

Thanks

1
Most likely you have space in schema part of the load statement.VK_217
Thanks , I resolved the issue. actually there was one more field at the beginning and that I defined as int , changed to long and it worked .Suman Banerjee

1 Answers

1
votes

Convert the columns to datetime object using ToDate and use DaysBetween.This should give the difference and if the difference == 1 then filter.Finally load it hive.

A = LOAD '/user/inp.txt' USING  PigStorage('\t') as (col_1:chararray,col_2:chararray);
B = FOREACH A GENERATE DaysBetween(ToDate(col_1,'yyyy-MM-dd HH:mm:ss'),ToDate(col_2,'yyyy-MM-dd HH:mm:ss')) as day_diff;
C = FILTER B BY (day_diff == 1);
STORE C INTO 'your_hive_partition' USING org.apache.hive.hcatalog.pig.HCatStorer();