Background
I want to develop a program to extract fields from unstructured log data. I am using grok
to identify the regular expressions that match the input string. While I have achieved the part where I identify the regexes, I want to merge the identified regexes into one, so as to match the entire string
Example -
Consider a CISCO PIX log line-
Mar 29 2004 09:54:18: %PIX-6-302005: Built UDP connection for faddr 198.207.223.240/53337 gaddr 10.0.0.187/53 laddr 192.168.0.2/53
For the logline above, I am identifying the following regular expressions -
CISCOTIMESTAMP - \b(?:Jan(?:uary|uar)?|Feb(?:ruary|ruar)?|M(?:a|ä)?r(?:ch|z)?|Apr(?:il)?|Ma(?:y|i)?|Jun(?:e|i)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|O(?:c|k)?t(?:ober)?|Nov(?:ember)?|De(?:c|z)(?:ember)?)\b +(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])(?: (?>\d\d){1,2})? (?!<[0-9])(?:2[0123]|[01]?[0-9]):(?:[0-5][0-9])(?::(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?))(?![0-9])
CISCOTAG - [A-Z0-9]+-(?:[+-]?(?:[0-9]+))-(?:[A-Z0-9_]+)
CISCOACTION - Built|Teardown|Deny|Denied|denied|requested|permitted|denied by ACL|discarded|est-allowed|Dropping|created|deleted
IPV4 - (?<![0-9])(?:(?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5]))(?![0-9])
URIPATH - (?:/[A-Za-z0-9$.+!*'(){},~:;=@#%_\-]*)+(?:\?[A-Za-z0-9$.+!*'|(){},~@#%&/=:;_?\-\[\]<>]*)?
Problem
Now, I want to merge
these regular expressions together, but I want to also include the fillers in between. Example -
Built|Teardown|Deny|Denied|denied|requested|permitted|denied by ACL|discarded|est-allowed|Dropping|created|deleted
This regular expression matches the Built
word in the logline, and -
(?<![0-9])(?:(?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5]))(?![0-9])
this identifies the first 198.207.223.240
(IP Address)
.
However, when I merge them together in regex101.com like this -
(Built|Teardown|Deny|Denied|denied|requested|permitted|denied by ACL|discarded|est-allowed|Dropping|created|deleted) ((?<![0-9])(?:(?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5]))(?![0-9]))
They don't gel well together, obviously, because there are words in between - UDP connection for faddr
- which I call 'fillers'
I want to combine the captured regular expressions while considering the arbitrary 'fillers' in between.
Is there a way to do this?
My Approach
I have tried using (.*)
and (.*?)
but they are too powerful, i.e., the supersede the other patterns and match the entire remaining line.
Can someone please help me with achieving my desired result?
An ideal result would be -
CISCOTIMESTAMP + [FILLER REGEX] + CISCOTAG + [FILLER REGEX] + CISCOACTION + [FILLER REGEX] + IPv4 + URIPATH + so on so forth.