This is in relation to determining if an int value from a tuple in one relation is a member value of a column from another relation in Pig Latin. I'm new to Pig Latin and finding it difficult to wrap my mind around the framework.
At the moment I have two tables, one containing a list of ids against tags with a small domain of values, and another with tuples containing an id and a tag id referring to the other table.
Here's orders.csv:
id, tag
1597, x
999, y
787, a
812, x
And tags.csv:
id, tag_id
11, 55
99, 812
22, 787
I need a method of working out if the tag_id of all tuples in the order table are a member of the subset of the ids of the tag table.
id, has_x
111, 0
99, 1
22, 0
This is what I have so far:
register 's3://bucket/jython_task.py' using jython as task;
tags = load 's3://bucket/tags.csv' USING PigStorage(',') AS (id: long, tag: chararray);
orders = load 's3://bucket/orders.csv' USING PigStorage(',') AS (id: long, tag_id: long);
tags = filter tags by tag == 'x';
x_cases = foreach tags generate tag;
tagged_orders = foreach orders generate id, tag_id, tasks.check_membership(tag_id, x_cases.tag) as is_x:int;
and the udf:
def check_membership(instance, value_list):
if instance != None:
for value in value_list:
if instance == value[0]:
return 1
return 0
I get the error:
2012-09-20 23:53:45,377 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (7995), 2nd :(8028)
What am I doing wrong? is there a better way to be doing what I'm looking to do?