0
votes

I'm trying to parse raw log using PIG.

Here's the sample of the data:

Thu Jul 13 06:02:36.157 2014 INFO:  pid 018 8: 2:81924:=[]| A=[100]| B=[]| C=[0] | D=[32]| E=[1]| F=[~1~0~1~0]| G= | H=[14]| I=[~0~0~0~1~0~0]| J=[1]| K=[0]| L=[0]

Thu Jul 13 16:42:36.213 2014 INFO:  pid 08 8: 2:81931: Dispatcher:UID=1F4, A=32, B=0, F=2, H=2, J=0, L=414

Thu Jul 03 16:42:36.646 2014 WARNING:  pid 028 8: 2:81939: no data found

Expected Output:

(date, A, H, L)

(Thu Jul 13 6:02:36.57 2014, 100, 14, 0)
(Thu Jul 13 16:42:36.13 2014, 32, 2, 414)

Sample Script:

Data is not properly separated (like tab or comma), so I tried to load the data using textloader(). As I need to extract information from 'info' that's why filter the logs based on info and warning.

raw_logs = LOAD 'test1.log' USING TextLoader() as (line:chararray);

info = FILTER raw_logs BY (line matches '.*INFO.*');

warning = FILTER raw_logs BY (line matches '.*WARNING.*');

data = FOREACH info GENERATE SUBSTRING(line, 0, 28) AS date, REGEX_EXTRACT(line, 'A=([0-9]*)',1) AS A, REGEX_EXTRACT(line, 'H=([0-9]*)',1) AS H, REGEX_EXTRACT(line, 'L=([0-9]*)',1) AS L;

Output:

(Thu Jul 13 06:02:36.157 2014 ,,,)
(Thu Jul 13 16:42:36.213 2014 ,32,2,414)

My REGEX is not able to extract from [ ]. Also, I tried FLATTEN and TOKENIZE function but not able to get desired output. Can anyone please suggest me right approach to sort out the problem.

One more thing, is there any possible way to write a regular expression for REGEX_EXTRACT_ALL function which works in all 3-cases and extract all fields information such as pid, uid, and other fields etc.

2
This question deals with a similar issue, and there are two different approaches described. - Eyal

2 Answers

0
votes

I think you need to escape the []

data = foreach info generate ..., REGEX_EXTRACT(line, 'A=\\[([0-9]*+)\\]', 1)
                                , REGEX_EXTRACT(line, 'H=\\[([0-9]*+)\\]', 1)
                                , REGEX_EXTRACT(line, 'L=\\[([0-9]*+)\\]', 1)
0
votes

Try this,

result = FOREACH info GENERATE SUBSTRING(line, 0, 28) AS date, REGEX_EXTRACT(line, 'A=\\[*([0-9]*)\\]*',1) AS A, REGEX_EXTRACT(line, 'H=\\[*([0-9]*)\\]*',1) AS H, REGEX_EXTRACT(line, 'L=\\[*([0-9]*)\\]*',1) AS L;

Hope it will work and worked for me...