2
votes

Here is my Code :

    claims = LOAD 'Darshan/automobile_insurance_claims.csv' USING  PigStorage(',') AS (claim_id:chararray, policy_master_id:chararray, registration_no:chararray, engine_no:chararray, chassis_no:chararray, customer_id:int, Col6:int,first_name:chararray, last_name:chararray,street:chararray,address:chararray,    city:chararray, zip:long,gender:chararray, claim_date:chararray, garage_city:chararray, bill_no:long, claim_amount:double, garage_name:chararray,claim_status:chararray);  

    grp_all = group claims all; 

    avg = foreach  grp_all generate AVG(claims.Col6); 
    grp = group claims by claim_id;  

    m = foreach grp generate group, ((Col6 IS NULL) ? avg : Col6);   

Results: dump avg; #33.45

------------------------------------------------------------------------------------------------------------------------------------Showing following error while replacing NULL values in Col6(i.e. Age): Caused by:
Invalid scalar projection: avg : A column needs to be projected from a relation for it to be used as a s calar
at org.apache.pig.parser.LogicalPlanGenerator.var_expr(LogicalPlanGenerator.java:10947)
at org.apache.pig.parser.LogicalPlanGenerator.expr(LogicalPlanGenerator.java:10164)
at org.apache.pig.parser.LogicalPlanGenerator.bin_expr(LogicalPlanGenerator.java:11992)
at org.apache.pig.parser.LogicalPlanGenerator.projectable_expr(LogicalPlanGenerator.java:11104)
at org.apache.pig.parser.LogicalPlanGenerator.var_expr(LogicalPlanGenerator.java:10815)
at org.apache.pig.parser.LogicalPlanGenerator.expr(LogicalPlanGenerator.java:10164)
at org.apache.pig.parser.LogicalPlanGenerator.flatten_generated_item(LogicalPlanGenerator.java:7493)
at org.apache.pig.parser.LogicalPlanGenerator.generate_clause(LogicalPlanGenerator.java:17595)
at org.apache.pig.parser.LogicalPlanGenerator.foreach_plan(LogicalPlanGenerator.java:15987)
at org.apache.pig.parser.LogicalPlanGenerator.foreach_clause(LogicalPlanGenerator.java:15854)
at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1933)
at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102)
at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:188)
... 17 more
2016-08-08 05:51:07,297 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse:
Invalid scalar projection: avg : A column needs to be projected from a relation for it to be used as a s calar .

Line 11 is: m = foreach grp generate group, ((Col6 IS NULL) ? avg : Col6);

3
m = foreach grp generate group, ((claims.Col6 IS NULL) ? avg.$0 : claims.Col6);VK_217
please provide sample data set..Ankur Alankar Biswal
@ankur automobile_insurance_claims.csv : github.com/pradeep-pasupuleti/pig-design-patterns/blob/master/…Darshan

3 Answers

1
votes

Darshan, this doesn't look like logic issue which you are exactly trying. You can replace NULLs by AVG but here the issue is projection of a column.

To solve this please revisit your code once again and you may find that AVG is in different relation and you are accessing it from different relation.

In your code "avg" is a relation not a column, If I'm getting it correctly, After your first group statement where you are generating AVG, generate other columns as well and that's how you will have avg and col6 in same relation.

  1. Load your data
  2. Group your data as per your need
  3. Calculate AVG and generate other columns

If you want you can apply this replace logic in same FOREACH.

Please let me know if you still face any issue.

1
votes

You are getting the error because avg is a relation and you need to use a column in the relation avg.Correct your last PIG statement to refer to the first column in the relation avg,like this

m = foreach grp generate group, ((claims.Col6 IS NULL) ? (double)avg.$0 : claims.Col6);

Alternatively you can name the column and refer to it as well,like this

avg = foreach  grp_all generate AVG(claims.Col6) AS AVG_Col6; 
grp = group claims by claim_id;  
m = foreach grp generate group, ((claims.Col6 IS NULL) ? (double)avg.AVG_Col6 : claims.Col6);  
0
votes

Here is the Final Code for my query:

claims = LOAD 'Darshan/automobile_insurance_claims.csv' USING  PigStorage(',') AS 
         (claim_id:chararray, policy_master_id:chararray, registration_no:chararray, 
         engine_no:chararray, chassis_no:chararray, customer_id:int, Col6:int,
         first_name:chararray, last_name:chararray,street:chararray,address:chararray,
         city:chararray, zip:long,gender:chararray, claim_date:chararray,
         garage_city:chararray, bill_no:long, claim_amount:double,
         garage_name:chararray,claim_status:chararray);  

grp_all = group claims all; 
avg = foreach  grp_all generate AVG(claims.Col6); 
grp = group claims by claim_id; 

result = foreach grp { 
             val = foreach claims generate ((Col6 IS NULL) ? avg.$0 : Col6);
             generate group, val; 
         };

Here is the link to dataset automobile_insurance_claims.csv