0
votes

I am new to Pig and still exploring efficient ways to do simple things. For example, I have a bag of events

{"events":[{"event": ev1}, {"event": ev2}, {"event":ev3}, ....]}

And I want to collapse that as just a tuple, something like

{"events":[ev1, ev2, ev3, ....]}

Is there a way to achieve this in Pig? I have veen struggling through this for a while, but without much success :(.

Thanks in advance

2
Do you know the number of fields that you need to convert? You should only use a tuple when you know exactly how many fields/columns you will have. - mr2ert
No. The set is going to be dynamic. I just don't want that repeated "event" key anymore, as it serves no particular purpose here. - Nikhil J Joshi
Then you can just project out the event field. - mr2ert
Thanks for the reference. Though I don't think that solution answers my question, as I already have all the event grouped by. I just need a mechanism to walk through the bag and strip off the key string from each tuple("event": Val). - Nikhil J Joshi

2 Answers

0
votes

Looking at your input it seems that your schema is something like:

A: {name:chararray, vals:{(inner_name:chararray, inner_value:chararray)}}

As I mentioned in a comment to your question, actually turning this into an array of nothing but inner_values will be extremely difficult since you don't know how many fields you could potentially have. When you don't know the number of fields you should always try to use a bag in Pig.

Luckily, if you can in fact use a bag for this it is trivial:

-- Project out only inner_value from the bag vals
B = FOREACH A GENERATE name, vals.inner_value ;
0
votes

Thanks all for informative comments. They helped me.

However, I found I was missing an important feature of a Schema, namely, every field has a key, and a value (a map). So now I achieve what I wanted by writing a UDF converting the bag to a comma separated string of values:

 package BagCondenser;

 import java.io.IOException;
 import java.util.Iterator;

 import org.apache.pig.EvalFunc;
 import org.apache.pig.data.DataBag;
 import org.apache.pig.data.Tuple;


 public class BagToStringOfCommaSeparatedSegments
     extends EvalFunc<String> {

    @Override
    public String exec(Tuple input) throws IOException {

          // Condensed bag to be returned
          String listOfSegmentIds = new String("");

          // Cast the input to a bag
          Object inputObject = input.get(0);

          // Throw error if not bag-able input
          if (!(inputObject instanceof DataBag))
              throw new IOException("Expected input to be a bag, but got: "
                   + inputObject.getClass());

          // The input bag
          DataBag bag = (DataBag) inputObject;
          Iterator it = bag.iterator();

         // Collect second fields of each tuple and add to the output bag
         while(it.hasNext()) {
             // If the return string already had values, append a ','
             if ( ! listOfSegmentIds.equals("") )
                 listOfSegmentIds += ",";

             Tuple tuple = (Tuple) it.next();

             listOfSegmentIds += tuple.get(0).toString();
        }

        return listOfSegmentIds;

    }

 }