0
votes

I want to filter the records of data set A whose flight_delay_time is less than some specific values(x).

But I will get the value of x from another pig query which is a tuple in the sense x is a tuple.

But using the following statement is throwing an error:

B = FILTER A by flight_delay_time < x;
dump B;

The data in file A is in the following way;

ravi,savings,avinash,2,char,33,F,22,44,12,13,33,44,22,11,10,22,26 avinash,current,sandeep,3,char,44,M,33,11,10,12,33,22,39,12,23,19,35 supreeth,savings,prabhash,4,char,55,F,22,12,23,12,44,56,7,88,34,23,68 lavi,current,nirmesh,5,char,33,M,11,10,33,34,56,78,54,23,445,66,77 Venkat,savings,bunny,6,char,11,F,99,12,34,55,33,23,45,66,23,23,28

the value of x = (40) which is stored as a tuple.

the last column in the above data denotes the flight_delay_time.

I am extracting the value of X in the following way.

following is the data stored in C_CONTROL_BATCH.txt

25 35 40 15

I used following code to extract the value of X.

control_batch = LOAD 'C_CONTROL_BATCH.txt' AS (start:int);
variable = ORDER control_batch BY start DESC;
X = LIMIT starttime 1;
1
Is x going to have 1 value? Have you tried FLATTEN(x); - nobody
Yes x is going have only value and i have tried FLATTEN(x) but it did not work any other way to filter those records other than using filter? - ravi theja
Can you post sample data? - nobody
yeah.....i have updated it in the question now. - ravi theja
Do check if the relation denoted by x, contains the value i.e, 40, in the relation B. It may happen that the relation B is not able to get the value of x - hello_abhishek

1 Answers

1
votes

Here is the solution:

INPUT

We have two input files:

  1. airlinesdata.txt - Having the rawdata ravi,savings,avinash,2,char,33,F,22,44,12,13,33,44,22,11,10,22,26 avinash,current,sandeep,3,char,44,M,33,11,10,12,33,22,39,12,23,19,35 supreeth,savings,prabhash,4,char,55,F,22,12,23,12,44,56,7,88,34,23,68 lavi,current,nirmesh,5,char,33,M,11,10,33,34,56,78,54,23,445,66,77 Venkat,savings,bunny,6,char,11,F,99,12,34,55,33,23,45,66,23,23,28

  2. x.txt - Having data from where we get the values of x - 20 30 35 38 37 40 29

flight_delay_time column is last column in below relation and of type int.

Note - If you don't declare it here the program with thrown an exception that it cant cast from byterarray to int when you filter in the end.

rawdata = LOAD 'airlinesdata.txt' USING PigStorage(',') AS (field1:chararray,field2:chararray,field3:chararray,field4:chararray,field5:chararray,field6:chararray,field7:chararray,field8:chararray,field9:chararray,field10:chararray,field11:chararray,field12:chararray,field13:chararray,field14:chararray,field15:chararray,field16:chararray,field17:chararray,flight_delay_time:int);

x_data = LOAD 'x.txt' USING PigStorage() AS (x_val:int);

order_x_data = ORDER x_data BY x_val desc;

max_value = LIMIT order_x_data 1;

Here we are again casting the value to int for the filter condition to work.

max_value_casted = FOREACH max_value GENERATE $0 as (maxval:int);

Finally we can issue the filter query to get the results. Note how the maxval is accessed below by using the . operator from the max_value_casted relation.

output_data = FILTER rawdata BY flight_delay_time < max_value_casted.maxval;

DUMP output_data;

OUTOUT - Values smaller than max value of X (40)

(ravi,savings,avinash,2,char,33,F,22,44,12,13,33,44,22,11,10,22,26) (avinash,current,sandeep,3,char,44,M,33,11,10,12,33,22,39,12,23,19,35) (Venkat,savings,bunny,6,char,11,F,99,12,34,55,33,23,45,66,23,23,28)

Hope it helps :)