0
votes

I am trying to extract the job name , region from Splunk source using regex .

Below is the format of my sample source :

/home/app/abc/logs/20200817/job_DAILY_HR_REPORT_44414_USA_log

With the below , I am able to extract job name :

(?<logdir>\/[\W\w]+\/[\W\w]+\/)(?<date>[^\/]+)\/job_(?<jobname>.+)_\d+

Here is the match so far :

Full match  0-53    /home/app/abc/logs/20200817/job_DAILY_HR_REPORT_44414
Group `logdir`  0-19    /home/app/abc/logs/
Group `date`    19-27   20200817
Group `jobname` 32-47   DAILY_HR_REPORT

I also need USA (region) from the source . Can you please help suggest. Region will always appear after number field (44414) , which can vary in number of digits. Ex: 123, 1234, 56789

Thank you in advance.

1
Your regex seems quite appropriate for what you achieved. Why can't you develop the last part the same way? Is there a special obstacle which got you stuck? What did you try? How did it fail?Yunnosch

1 Answers

2
votes

You could make the pattern a bit more specific about what you would allow to match as [\W\w]+ and .+ will cause more backtracking to fit the rest of the pattern.

Then for the region you can add a named group at the end (?<region>[^\W_]+) matching one or more times any word character except an underscore.

In parts

(?<logdir>\/(?:[^\/]+\/)*)(?<date>(?:19|20)\d{2}(?:0?[1-9]|1[012])(?:0[1-9]|[12]\d|3[01]))\/job_(?<jobname>\w+)_\d+_(?<region>[^\W_]+)_log
  • (?<logdir> Group logdir
    • \/(?:[^\/]+\/)* match / and optionally repeat any char except / followed by matching the / again
  • ) Close group
  • (?<date> Group date
    • (?:19|20)\d{2} Match a year starting with 19 or 20
    • (?:0?[1-9]|1[012]) Match a month
    • (?:0[1-9]|[12]\d|3[01]) Match a day
  • ) Close group
  • \/job_ Match /job_
  • (?<jobname>\w+) Group jobname, match 1+ word chars
  • _\d+_ Match 1+ digits between underscores
  • (?<region>[^\W_]+) Group region Match 1+ occurrences of a word char except _
  • _log Match literally

Regex demo