0
votes

I am doing some practice on weblog parsing and here is a question on regex:

The log file is in the format of:

in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0" 200 1839

I need to get the timestamp, here is what I have now:

regexp_extract('value', r'((\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4}))', 1).alias('timestamp'),

This returns me:

01/Aug/1995:00:00:01 -0400

My question is what does -0400 means? time zone? How do I remove it?

1
Do you have any understanding of how regular expressions work? Because it should be very obvious to you which part of the regular expression matches -0400 in that string. And yes, it's a time zone. - miken32
Regex is honorsly so confusing me, but I am willing to learn because it is really powerful. All I want to know is how to get rid of the -0400 and all I need to do is to remove -\d{4} - mdivk
That is exactly right. \d is a number and {4} means there are four of them. - miken32

1 Answers

0
votes

Yes - that's a timezone.

You can simply remove it by eliminating -\d{4} part of the pattern. So this is what you're looking for:

regexp_extract('value', r'((\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2}))', 1).alias('timestamp'),

Online Demo

Also as a explanation:

  • - matches a dash plus a space after it literally
  • \d matches a digit
  • {4} limits it to only 4 digits