I just start to develop with apache pig. I have a file stored on the HDFS Measurements.csv structured as follow
1;0x3333333333331091;21.2;67.5;2.1;2.0;12.2;15/04/2014 15:50
2;0x3333333333331091;21.2;67.4;2.1;12.0;8.5;15/04/2014 14:22
3;0x3333333333331091;21.2;67.4;2.1;18.0;7.2;15/04/2014 14:22
4;0x3333333333331091;21.2;69.5;2.1;19.0;3.2;15/04/2014 14:22
5;0x3333333333331091;21.2;67.5;2.1;21.0;13.5;15/04/2014 14:22
6;0x3333333333331091;21.3;69.4;2.1;14.0;15.1;15/04/2014 14:22
7;0x3333333333331091;21.3;70.4;2.1;19.0;16.7;15/04/2014 14:22
8;0x3333333333331091;21.2;68.3;2.1;8.0;22.1;15/04/2014 14:22
9;0x3333333333331091;21.3;67.3;2.1;2.0;11.8;15/04/2014 14:23
10;0x3333333333331091;21.3;67.4;2.0;32.0;19.1;15/04/2014 14:23
I load it with the command:
Mesure = LOAD 'dataTest/measurements.csv' USING PigStorage(';') as (idSensor:int, address:chararray, temperature:float, humidity:float, voltage:float, locX:float, locY:float, time:chararray );
What I need to do is
- To calculate the averages of temperature of all pairs including the temperature of the first line.
- To calculate the averages of temperature composed by 3 temperatures including the temperature of the first line
- To calculate the average of temperature composed of composed of 9 temperatures of average including the temperature of the first line
- And finally calculate the obe average composed by all the temperatures in the file.
Can someone show me the way to do this with apache pig.
After testing hardly to answer myselft i finally obtained the following results.
I was able to generate permutation of temperature. but what i want is to eliminate redundancy by listing the combinations instead of permtations.
I proceed as follow, first, i select only the a list of temperature, then i did it twice so i can applay the cross operator over the two list.
TEMP1 = FOREACH Mesure GENERATE temperature as temp1;
TEMP2 = FOREACH Mesure GENERATE temperature as temp2;
After that i cross :
Result = CROSS TEMP1,TEMP2;
And it display something like this :
(21.3,21.3)
(21.3,21.3)
(21.3,21.2)
(21.3,21.3)
(21.3,21.3)
(21.3,21.2)
(21.3,21.2)
(21.3,21.2)
(21.3,21.2)
(21.3,21.2)
(21.3,21.3)
....
this is what i was able to do, now i'm asking, is there something that can delete the replication of pairs ? and when it is done how to calculate the average of each pair.