0
votes

My table structure is like this:

hive> describe user_data2;
OK
received_at             string                                      
message_id              string                                      
type                    string                                      
version                 string                                      
timestamp_user          string                                      
user_id                 string                                      
sent_at                 string                                      
channel                 string                                      
time_log                string 

And I am targetting this fields,

hive> select received_at, time_log, user_id from user_data2 limit 5;
OK
2016-01-08T12:27:05.565Z    1452256025  836871
2016-01-08T12:27:12.634Z    1452256033  800798
2016-01-08T12:27:12.632Z    1452256033  795799
2016-01-08T12:27:13.694Z    1452256033  820359
2016-01-08T12:27:15.821Z    1452256036  294141

On this I want to make time based query. like

  1. Avg Hours Active; per month; period: last 12 months
  2. % of users active between 0-1h / day
  3. % of users active between 1-2h / day
  4. % of users active between 2-4h / day
  5. % of users active between 4-8h / day
  6. % of users active between 8-12h / day
  7. % of users active between 12-16h / day
  8. % of users active between 16-24h / day

I got some clue of using Datetime UDF - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions

But I am not aware how to use this function.

I tried:

select unix_timestamp(received_at) from user_data2 limit 5;
OK
NULL
NULL
NULL
NULL
NULL

Which gives none.

I appreciate if someone give example of using time UDF and getting records between two hours or some other time frame.

1
RTFM - the documentation for unix_timestamp() is explicit, it expects an ODBC format without time zone while your data uses ISO8601 format with TZ >>> you must first reformat the STRING with some regexp_replace(), then convert to a "local TZ" TIMESTAMP with from_utc_timestamp(), and only then you will be able to play with unix_timestamp() as a LONG. - Samson Scharfrichter
By the way, a "UNIX timestamp" will not be very helpful, since you want to group records based on wall clock time... - Samson Scharfrichter
@SamsonScharfrichter: Thanks, but in time_log I have timestamp, using that can I play what I want? - Feedly news

1 Answers

1
votes

Assuming your local TZ is Rome...

select
  from_utc_timestamp(regexp_replace(regexp_replace(RECEIVED_AT, 'T',' '), '\\..*$',''), 'Europe/Rome') as TS_RECEIVED,
  cast(from_unixtime(cast(TIME_LOG as int)) as timestamp) as TS_LOGGED
from WTF ;

+------------------------+------------------------+--+
|      ts_received       |      ts_logged         |
+------------------------+------------------------+--+
| 2016-01-08 13:27:05.0  | 2016-01-08 13:27:05.0  |
+------------------------+------------------------+--+