1
votes

FILE COntent (test.txt):

Some    specific    column      value: x192.168.1.2     blah       blah
Some    specific    row        value: y192.168.1.3      blah       blah
Some    specific    field      value: z192.168.1.4     blah      blah

PIG QUERY:

A = LOAD 'test.txt' USING PigStorage('\t') AS (data1: chararray , data2: chararray , data3: chararray, data4: chararray , data5: chararray , data6: chararray);

B = foreach A generate data3, data4;

C = filter B by data3 matches 'row';

D = foreach C generate data4;

E = foreach D generate TOKENIZE(data4);

Output :

((value:), (y192.168.1.3))

Now i want to extract specific tuple in this output bag, say second tuple (y192.168.1.3). After this i want to extract the IP address. I am trying to do with UDFs but got stuck.

3
Pig allows regex matching. Have you tried that? - Ray Toal
is regex in pig is same as java??? - pradeep
I tried with this : E = foreach D generate REGEX_EXTRACT(message,'Internet:*') As result; but it throws an error :- ERROR 1045: Could not infer the matching function for org.apache.pig.builtin.REGEX_EXTRACT as multiple or none of them fit. Please use an explicit cast. My Ip in message is written as Internet:192.x.x.x - pradeep

3 Answers

3
votes

Here is what I would do.

PIG Script

A = LOAD 'test.txt' USING PigStorage('\t') AS (data1: chararray , data2: chararray , data3: chararray, data4: chararray , data5: chararray , data6: chararray);
B = foreach A generate data3, data4;
C = filter B by data3 matches 'row';
D = foreach C generate data4;
E = foreach D generate REGEX_EXTRACT($0,'value: .([0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+).*', 1);

Output

(192.168.1.3)

If needed, you can use a more crazy regexp to extract the IP address: Extract ip addresses from Strings using regex

3
votes

You could use Flatten Operator to flatten the bag and then use filter to extract the ip address.

E = foreach C generate flatten(TOKENIZE(data4));
F = filter E by $0 matches '.\\d+\\.\\d+\\.\\d+\\.\\d+'

Hope this helps

1
votes
public class someClass extends EvalFunc<String>
{
   public String exec(Tuple input) throws IOException {
     DataBag bag = (DataBag)input.get(0);
     Iterator<Tuple> it = bag.iterator();
     Tuple tup;
     for(int i = 0; i < 2; i++)
     {
       tup = it.next();
     }
     String ipString = tup.get(0);
     String ip = //get ip from string with a regex
     return ip;
   }
 }

of course you should add some input checks (null inputs, bag sized 1, etc) and secure the code.