2
votes

I am trying to use some Pig functions on Titanic data. At one point I narrow it down to Passenger Class and Fare (Ticket price):

Here's the code:

sh echo "1. create FarePclass with two fields"
FarePclass   =  FOREACH train GENERATE Pclass,Fare ;
DUMP FarePclass;

sh echo "2. create FareByClass grouping by Pclass"
FareByPclass = GROUP FarePclass BY Pclass ;
--FareByPclass = GROUP FarePclass ALL;
--DUMP FareByPclass;

DESCRIBE FareByPclass;

sh echo "3. get average"
AvgFareByPclass = FOREACH FareByPclass GENERATE (float) SUM(FarePclass.Fare);

Here's some sample rows from the DUMP statement in step #1 and output:

(2,10.5)
(3,7.05)
(3,29.125)
(2,13)
(1,30)
(3,23.45)
(1,30)
(3,7.75)
2. create FareByClass grouping by Pclass
FareByPclass: {group: chararray,FarePclass: {(Pclass: chararray,Fare: chararray)}}
3. get average
2014-08-28 20:56:23,288 ERROR org.apache.pig.tools.grunt.Grunt: ERROR 1045: 
<file titanic_dypler_datafu.pig, line 36, column 56> Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.

I have this script and I'm trying to get the final line to run. AvgFareByPclass = FOREACH FareByPclass GENERATE (float) SUM(FarePclass.Fare);

I get this error when trying to run it: Cannot cast bag with schema :bag{:tuple(Fare:chararray)} to float.

Can you suggest how to cast FarePclass.Fare? Am I missing something conceptually about how to go about this?

1

1 Answers

1
votes

After you've already tried summing them is well too late to try to convert the chararray Fares into floating point numbers; they need to be numbers before you can take their sum. Probably the most sensible place to do the conversion is in that first projection to FarePclass:

FarePclass   =  FOREACH train GENERATE Pclass,(float)Fare ;