1
votes

My log file record looks like this:

107.344.154.200 - - [23/Aug/2005:00:03:14 -0400] "GET /images/theimage.gif HTTP/1.0" 200 11401

I have this this syntax to parse a log file

CREATE TABLE logfile (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING, size STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]) ([^ ]) ([^ ]) (-|\[[^\]]\]) ([^ \"]|\"[^\"]\") (-|[0-9]) (-|[0-9])",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s" ) STORED AS TEXTFILE;

What regex syntax could i use to parse time where it will split [23/Aug/2005:00:03:14 -0400] by day month year minutes second?

1
Just a friendly tip, you may want to read over this page: The How-To-Ask Guide so you can always be sure that your questions are easily answerable and as clear as possible. Be sure to include any efforts you've made to fix the problem you're having, and what happened when you attempted those fixes. Also don't forget to your show code and any error messages! - Matt C

1 Answers

1
votes

Description

This regex will do the following:

  • Parse the log entry and look for the date and time
  • Capture the various date parts, like day, month, year, hour, minute, second, UTC offset

The Regex

\[(\d{2})/([a-zA-Z]{3})/(\d{4}):(\d{2}):(\d{2}):(\d{2})\s(-\d{4})]

Note, depending on the language you may have have to escape the / by replacing them with \/. but ever language is different.

Explanation

Regular expression visualization

NODE                     EXPLANATION
----------------------------------------------------------------------
  \[                       '['
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    \d{2}                    digits (0-9) (2 times)
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  /                        '/'
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    [a-zA-Z]{3}              any character of: 'a' to 'z', 'A' to 'Z'
                             (3 times)
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
  /                        '/'
----------------------------------------------------------------------
  (                        group and capture to \3:
----------------------------------------------------------------------
    \d{4}                    digits (0-9) (4 times)
----------------------------------------------------------------------
  )                        end of \3
----------------------------------------------------------------------
  :                        ':'
----------------------------------------------------------------------
  (                        group and capture to \4:
----------------------------------------------------------------------
    \d{2}                    digits (0-9) (2 times)
----------------------------------------------------------------------
  )                        end of \4
----------------------------------------------------------------------
  :                        ':'
----------------------------------------------------------------------
  (                        group and capture to \5:
----------------------------------------------------------------------
    \d{2}                    digits (0-9) (2 times)
----------------------------------------------------------------------
  )                        end of \5
----------------------------------------------------------------------
  :                        ':'
----------------------------------------------------------------------
  (                        group and capture to \6:
----------------------------------------------------------------------
    \d{2}                    digits (0-9) (2 times)
----------------------------------------------------------------------
  )                        end of \6
----------------------------------------------------------------------
  \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
  (                        group and capture to \7:
----------------------------------------------------------------------
    -                        '-'
----------------------------------------------------------------------
    \d{4}                    digits (0-9) (4 times)
----------------------------------------------------------------------
  )                        end of \7
----------------------------------------------------------------------
  ]                        ']'
----------------------------------------------------------------------

Sample text

107.344.154.200 - - [23/Aug/2005:00:03:14 -0400] "GET /images/theimage.gif HTTP/1.0" 200 11401

Live Demo

https://regex101.com/r/hF4fP8/1

Sample Matches

[0][0] = [23/Aug/2005:00:03:14 -0400]
[0][1] = 23
[0][2] = Aug
[0][3] = 2005
[0][4] = 00
[0][5] = 03
[0][6] = 14
[0][7] = -0400