2
votes

I just start to develop with apache pig. I have a file stored on the HDFS Measurements.csv structured as follow

    1;0x3333333333331091;21.2;67.5;2.1;2.0;12.2;15/04/2014 15:50    
    2;0x3333333333331091;21.2;67.4;2.1;12.0;8.5;15/04/2014 14:22
    3;0x3333333333331091;21.2;67.4;2.1;18.0;7.2;15/04/2014 14:22
    4;0x3333333333331091;21.2;69.5;2.1;19.0;3.2;15/04/2014 14:22
    5;0x3333333333331091;21.2;67.5;2.1;21.0;13.5;15/04/2014 14:22
    6;0x3333333333331091;21.3;69.4;2.1;14.0;15.1;15/04/2014 14:22
    7;0x3333333333331091;21.3;70.4;2.1;19.0;16.7;15/04/2014 14:22
    8;0x3333333333331091;21.2;68.3;2.1;8.0;22.1;15/04/2014 14:22
    9;0x3333333333331091;21.3;67.3;2.1;2.0;11.8;15/04/2014 14:23
    10;0x3333333333331091;21.3;67.4;2.0;32.0;19.1;15/04/2014 14:23

I load it with the command:

Mesure = LOAD 'dataTest/measurements.csv' USING PigStorage(';') as (idSensor:int, address:chararray, temperature:float, humidity:float, voltage:float, locX:float, locY:float, time:chararray );

What I need to do is

  • To calculate the averages of temperature of all pairs including the temperature of the first line.
  • To calculate the averages of temperature composed by 3 temperatures including the temperature of the first line
  • To calculate the average of temperature composed of composed of 9 temperatures of average including the temperature of the first line
  • And finally calculate the obe average composed by all the temperatures in the file.

Can someone show me the way to do this with apache pig.


After testing hardly to answer myselft i finally obtained the following results.

I was able to generate permutation of temperature. but what i want is to eliminate redundancy by listing the combinations instead of permtations.

I proceed as follow, first, i select only the a list of temperature, then i did it twice so i can applay the cross operator over the two list.

TEMP1 = FOREACH Mesure GENERATE temperature as temp1;
TEMP2 = FOREACH Mesure GENERATE temperature as temp2;

After that i cross :

Result = CROSS TEMP1,TEMP2;

And it display something like this :

(21.3,21.3)
(21.3,21.3)
(21.3,21.2)
(21.3,21.3)
(21.3,21.3)
(21.3,21.2)
(21.3,21.2)
(21.3,21.2)
(21.3,21.2)
(21.3,21.2)
(21.3,21.3)
....

this is what i was able to do, now i'm asking, is there something that can delete the replication of pairs ? and when it is done how to calculate the average of each pair.

1
Regarding the 1st point, what do you mean by average of temperature of all pairs ? You want to group 2 consecutive records together and then find avg ? Like average of 1st & 2nd is 21.2 and 3rd &4th, and so on ? Or to find average of all records? Kindly specify.Suraj Nayak
what i need to do is something like that : average(21.2,21.2) the first and second line, average(21.2,21.2) the first and third line ... until calculating average(21.2,21.3) the first and last line. when i finish calculating the average of pairs i launch calculating the average composed by three values, and always including the temperature of the first line. and i keep going until calculating the average composed of all values of temperature.SAAD NADIR
I'm really interested in this question. What do you expect the output to look like? For instance, a file for the 2s, a file for the 3s,...,etc... or one file. And how do you want the averages situated? columns or rows?gobrewers14

1 Answers

1
votes

Here is what you can do to get the average for 2 lines -

Mesure = LOAD 'filename' using PigStorage(';') as (idSensor:int, address, temperature:float, h, v, locx, locy,t); --since I am concerned with only the temperature and idSensor so I don't care about other fields.

TEMP1 = FOREACH Mesure GENERATE idSensor as num, temperature as t;
TEMP2 = FOREACH Mesure GENERATE idSensor as num, temperature as t;
B2 = CROSS TEMP1, TEMP2;
C2 = FOREACH B2 GENERATE TEMP1.num as firstnum, TEMP2.num as secondnum, ((TEMP1.t = TEMP2.t)/2);
Res2 = FILTER C2 BY firstnum <= secondnum;

Here Res2 will give you results of averaging the two lines. Filter should be moved up for optimization. Same way you can perform for multi lines.

Hope this helps.