1
votes

I am trying to read a log file whose contents look like this:

2013-03-28T12:19:03.639648-05:00 host1 rpcbind: rpcbind terminating on signal. Restart with "rpcbind -w"
2013-03-28T12:20:33.158823-05:00 host2 rpcbind: rpcbind terminating on signal. Restart with "rpcbind -w"

I have tried using the PigStorage space delimiter like so:

cmessages = LOAD 'data.txt' USING PigStorage(' ') AS (date:chararray, host:chararray, message:chararray);

But that kills the message in the third field, which I think might be useful later.

dump cmessages;

<snip>
(2013-03-28T12:19:03.639648-05:00,host1,rpcbind:)
(2013-03-28T12:20:33.158823-05:00,host2,rpcbind:)
</snip>

Is there a better way to read in this log file that doesn't require costly regular expressions or a UDF loader? There should be something in Pig that maybe says stop after the second space? Maybe not.

UPDATE: Just to revise what I want: Instead of

(2013-03-28T12:19:03.639648-05:00,host1,rpcbind:)

I'd like:

(2013-03-28T12:19:03.639648-05:00, host1, rpcbind: rpcbind terminating on signal. Restart with "rpcbind -w")

Essentially, I want the full log message in the last field of the tuple. I hope that is clearer.

1

1 Answers

1
votes

There is no perfect solution without knowing exactly the rules controlling your logs but if you assume that the date and host have a fixed length, you could use the following:

A = load 'mydata' as (log:charray);
B = foreach A generate SUBSTRING(name, 0, 31) AS date, 
                       SUBSTRING(name, 33, 37) AS host, 
                       SUBSTRING(name, 39, 255) AS message;

If they are only known to be delimited by the first 2 white spaces, you could use the following:

A = load 'mydata' as (log:charray);
B = foreach A generate log, INDEXOF(log, ' ', 0) as index;
C = foreach B generate log, index, INDEXOF(log, ' ', index + 1) AS index2;
D = foreach C generate SUBSTRING(log, 0, index) AS date, 
                       SUBSTRING(log, index + 1, index2) as host, 
                       SUBSTRING(log, index2+1, 255) as message;

You have to know the "rules" concerning the logs and then choose the appropriate method. Here I also assume your longest log is 256 character long.