How to extract timestamp and remove tailing portion from weblog using regex in pyspark?

Question

I am doing some practice on weblog parsing and here is a question on regex:

The log file is in the format of:

in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0" 200 1839

I need to get the timestamp, here is what I have now:

regexp_extract('value', r'((\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2} -\d{4}))', 1).alias('timestamp'),

This returns me:

01/Aug/1995:00:00:01 -0400

My question is what does -0400 means? time zone? How do I remove it?

Do you have any understanding of how regular expressions work? Because it should be very obvious to you which part of the regular expression matches -0400 in that string. And yes, it's a time zone. — miken32
Regex is honorsly so confusing me, but I am willing to learn because it is really powerful. All I want to know is how to get rid of the -0400 and all I need to do is to remove -\d{4} — mdivk
That is exactly right. \d is a number and {4} means there are four of them. — miken32

Shafizadeh Shafizadeh · Accepted Answer · 2016-07-28T03:11:11

Yes - that's a timezone.

You can simply remove it by eliminating -\d{4} part of the pattern. So this is what you're looking for:

regexp_extract('value', r'((\d\d/\w{3}/\d{4}:\d{2}:\d{2}:\d{2}))', 1).alias('timestamp'),

Also as a explanation: