1
votes

Please consider the following PIG data:

search_values = FOREACH raw_search GENERATE
                search_id,
                user_id,
                param_name,
                param_value;

describe search_values;
search_values: {search_id: int,user_id: int,param_name: chararray,param_value: chararray}

dump search_values;
(1, 1, location, San Francisco)
(1, 1, type, Commercial)

There could be multiple records for each search_id/user_id combination; thus, I'm grouping the records later in the code. However, I'm only interested in two specific param_names - 'location' and 'type': filtered = FILTER search_values by (param_name == 'type' or param_name == 'location');

In theory, there is always a row with 'location' and a row with 'type'; however, there are instances where 'type' is not there; therefore, I need to substitute it with 'All' (later).

I know that the easiest way to do that is to split the data by param_name and then (OUTER) join by the search_id; however, I would like to utilize the power of bags in PIG.

I've tried various approaches to using bags, convert bags to maps to no avail:

maps = FOREACH filtered GENERATE search_id, user_id, TOMAP(param_name, param_value) as tomap_values;
group_map = group maps by (search_id, user_id);
grouped = FOREACH group_map GENERATE 
                group.$0 as search_id,
                group.$1 as user_id,
                maps.tomap_values as map_bag;

The problem here is the map_bag is a map inside a bag and I am unable to extract values from it using map_bag#'type' or map_bag#'location'.

describe grouped:
{search_id: int,user_id: int,map_bag: {(tomap_values: map[])}}

If I try something like the following, I'm getting an error message:

mapped = FOREACH grouped
                    GENERATE
                    search_id,
                    user_id,
                    map_bag.tomap_values#'type',
                    map_bag.tomap_values#'location';
ERROR 1052: Cannot cast bag with schema :bag{:tuple(tomap_values:map)} to map with schema :map

The desired outcome should be

(search_id, user_id, type, location)
(1, 1, Commercial, San Francisco)

Any help resolving this would be greatly appreciated!

1

1 Answers

0
votes

Try using a FLATTEN to take the map out of the bag.