0
votes

I am querying a table in Hive with json payloads and am extracting the timestamp from these payloads. the problem is that timestamps are present in different timezone formats and I'm trying to extract them all in my timezone.

I am currently using the following:

select
from_unixtime(unix_timestamp(get_json_object (table.payload, '$.timestamp'), "yyyy-MM-dd'T'HH:mm:ss.SSSXXX"))
FROM table

This is returning the correct values if the timestamp is in this format: 2018-08-16T08:54:05.543Z --> 2018-08-16 18:54:05 (changed format and converted into my timezone)

However the query above returns 'null' if the payload contains the timestamp in this format: 2018-09-13T01:35:08.460+0000 2018-09-13T11:35:09+10:00

How can I adjust my query to work for all types of timestamps all converting to proper timezone (+10 is my timezone!) and all in the same format?

Thanks in advance!

2
You will need a very complex CASE WHEN logic with regular expression matching/replacing, to handle most formats (and don't expect to handle all - you would need information about the locale to match DEC/DEZ or 01/02 as Jan 2nd vs Feb 1st). Create a TEMPORARY MACRO if you can't develop a Java UDF. - Samson Scharfrichter

2 Answers

-1
votes

Without regexp use Z for +1000 of XXX for +10:00 :

select unix_timestamp('2016-07-30T10:29:33.000+03:00', "yyyy-MM-dd'T'HH:mm:ss.SSSXXX") as t1
select unix_timestamp('2016-07-30T10:29:33.000+0300', "yyyy-MM-dd'T'HH:mm:ss.SSSZ") as t2

Full docs about time formats: https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html

0
votes

How about the following macro:

create temporary macro extract_ts(ts string) 
  from_unixtime(unix_timestamp(regexp_extract(ts, '(.*)\\+(.*)', 1), "yyyy-MM-dd'T'HH:mm:ss") + 3600*cast(regexp_extract(ts, '(.*)\\+(.*)\\:(.*)', 2) as int));

e.g.,

hive> select extract_ts('2018-09-13T11:35:09+10:00');
OK
2018-09-13 21:35:09