0
votes

Background

I want to develop a program to extract fields from unstructured log data. I am using grok to identify the regular expressions that match the input string. While I have achieved the part where I identify the regexes, I want to merge the identified regexes into one, so as to match the entire string

Example -

Consider a CISCO PIX log line-

Mar 29 2004 09:54:18: %PIX-6-302005: Built UDP connection for faddr 198.207.223.240/53337 gaddr 10.0.0.187/53 laddr 192.168.0.2/53

For the logline above, I am identifying the following regular expressions -

CISCOTIMESTAMP - \b(?:Jan(?:uary|uar)?|Feb(?:ruary|ruar)?|M(?:a|ä)?r(?:ch|z)?|Apr(?:il)?|Ma(?:y|i)?|Jun(?:e|i)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|O(?:c|k)?t(?:ober)?|Nov(?:ember)?|De(?:c|z)(?:ember)?)\b +(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])(?: (?>\d\d){1,2})? (?!<[0-9])(?:2[0123]|[01]?[0-9]):(?:[0-5][0-9])(?::(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?))(?![0-9])

CISCOTAG - [A-Z0-9]+-(?:[+-]?(?:[0-9]+))-(?:[A-Z0-9_]+)

CISCOACTION - Built|Teardown|Deny|Denied|denied|requested|permitted|denied by ACL|discarded|est-allowed|Dropping|created|deleted

IPV4 - (?<![0-9])(?:(?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5]))(?![0-9])

URIPATH - (?:/[A-Za-z0-9$.+!*'(){},~:;=@#%_\-]*)+(?:\?[A-Za-z0-9$.+!*'|(){},~@#%&/=:;_?\-\[\]<>]*)?


Problem

Now, I want to merge these regular expressions together, but I want to also include the fillers in between. Example -

Built|Teardown|Deny|Denied|denied|requested|permitted|denied by ACL|discarded|est-allowed|Dropping|created|deleted

This regular expression matches the Built word in the logline, and -

(?<![0-9])(?:(?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5]))(?![0-9])

this identifies the first 198.207.223.240 (IP Address).

However, when I merge them together in regex101.com like this -

(Built|Teardown|Deny|Denied|denied|requested|permitted|denied by ACL|discarded|est-allowed|Dropping|created|deleted) ((?<![0-9])(?:(?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5]))(?![0-9]))

They don't gel well together, obviously, because there are words in between - UDP connection for faddr - which I call 'fillers'

I want to combine the captured regular expressions while considering the arbitrary 'fillers' in between.

Is there a way to do this?


My Approach

I have tried using (.*) and (.*?) but they are too powerful, i.e., the supersede the other patterns and match the entire remaining line.


Can someone please help me with achieving my desired result?

An ideal result would be -

CISCOTIMESTAMP + [FILLER REGEX] + CISCOTAG + [FILLER REGEX] + CISCOACTION + [FILLER REGEX] + IPv4 + URIPATH + so on so forth.

1
your patterns seem to work fine with .* and .*?thatNLPguy

1 Answers

0
votes

URIPATH seems off on regex101. You haven't escaped '/'
Once that's done, this works.

URIPATH: ((?:\/[A-Za-z0-9$.+!*'(){},~:;=@#%_\-]*)+(?:\?[A-Za-z0-9$.+!*'|(){},~@#%&\/=:;_?\-\[\]<>]*)?)

And with that the rest works, with .* as the filler regex.

CISCOTIMESTAMP + [FILLER REGEX] + CISCOTAG + [FILLER REGEX] + CISCOACTION + [FILLER REGEX] + IPv4 + URIPATH

The entire regex below

(\b(?:Jan(?:uary|uar)?|Feb(?:ruary|ruar)?|M(?:a|ä)?r(?:ch|z)?|Apr(?:il)?|Ma(?:y|i)?|Jun(?:e|i)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|O(?:c|k)?t(?:ober)?|Nov(?:ember)?|De(?:c|z)(?:ember)?)\b +(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])(?: (?>\d\d){1,2})? (?!<[0-9])(?:2[0123]|[01]?[0-9]):(?:[0-5][0-9])(?::(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?))(?![0-9])).*([A-Z0-9]+-(?:[+-]?(?:[0-9]+))-(?:[A-Z0-9_]+)).*(Built|Teardown|Deny|Denied|denied|requested|permitted|denied by ACL|discarded|est-allowed|Dropping|created|deleted).*((?<![0-9])(?:(?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5]))(?![0-9]))((?:\/[A-Za-z0-9$.+!*'(){},~:;=@#%_\-]*)+(?:\?[A-Za-z0-9$.+!*'|(){},~@#%&\/=:;_?\-\[\]<>]*)?)

Here's a link to a demo