Please consider the following PIG data:
search_values = FOREACH raw_search GENERATE
search_id,
user_id,
param_name,
param_value;
describe search_values;
search_values: {search_id: int,user_id: int,param_name: chararray,param_value: chararray}
dump search_values;
(1, 1, location, San Francisco)
(1, 1, type, Commercial)
There could be multiple records for each search_id/user_id combination; thus, I'm grouping the records later in the code. However, I'm only interested in two specific param_names - 'location' and 'type': filtered = FILTER search_values by (param_name == 'type' or param_name == 'location');
In theory, there is always a row with 'location' and a row with 'type'; however, there are instances where 'type' is not there; therefore, I need to substitute it with 'All' (later).
I know that the easiest way to do that is to split the data by param_name and then (OUTER) join by the search_id; however, I would like to utilize the power of bags in PIG.
I've tried various approaches to using bags, convert bags to maps to no avail:
maps = FOREACH filtered GENERATE search_id, user_id, TOMAP(param_name, param_value) as tomap_values;
group_map = group maps by (search_id, user_id);
grouped = FOREACH group_map GENERATE
group.$0 as search_id,
group.$1 as user_id,
maps.tomap_values as map_bag;
The problem here is the map_bag is a map inside a bag and I am unable to extract values from it using map_bag#'type' or map_bag#'location'.
describe grouped:
{search_id: int,user_id: int,map_bag: {(tomap_values: map[])}}
If I try something like the following, I'm getting an error message:
mapped = FOREACH grouped
GENERATE
search_id,
user_id,
map_bag.tomap_values#'type',
map_bag.tomap_values#'location';
ERROR 1052: Cannot cast bag with schema :bag{:tuple(tomap_values:map)} to map with schema :map
The desired outcome should be
(search_id, user_id, type, location)
(1, 1, Commercial, San Francisco)
Any help resolving this would be greatly appreciated!