0
votes

enter image description hereI have a hive table with more than millions records. The input is of the following type: Input:

rowid     |starttime              |endtime                  |line   |status
--- 1     2007-07-19 00:05:00    2007-07-19 00:23:00        l1     s1 
--- 2     2007-07-20 00:00:10    2007-07-20 00:22:00        l1     s2
--- 3     2007-07-19 00:00:00    2007-07-19 00:11:00        l2     s2

What I want to do is when 1st order the table by starttime group by line. Then find the difference between two consecutive rows endtime and starttime. If the difference is more than 5mins then in a new table add a new row in between with status misstime.

In input row 1 & 2 the time time difference is 1 hour 10 mins so 1st I will create row for 19th Date and complete that days with missing time and then add one more row for 20th as below.

output:

rowid |starttime               |endtime              |line   |status
--- 1 |2007-07-19 00:05:00     |2007-07-19 00:23:00  |l1     |s1 
--- 2 |2007-07-19 00:23:01     |2007-07-19 00:00:00  |l1     |misstime
--- 3 |2007-07-20 00:00:01     |2007-07-20 00:00:09  |l1     |misstime
--- 4 |2007-07-20 00:00:10     |2007-07-20 00:22:00  |l1     |s2
--- 3 |2007-07-19 00:00:00     |2007-07-19 00:11:00  |l2     |s2 

Can anyone help me achieve this directly in hue - hive ? Unix script will also do.

Thanks in advance.

1
Can you please share sample date and corresponding result set in proper readable formate - sandeep rawat
Sorry about the formatting. Let me try to submit this in picture file format. thanks for replying though. - DrSD
Please click on the link to check the image view of the problem with formatting. Sorry about the inconvenience caused. - DrSD

1 Answers

0
votes

The solution template is:

  1. Use LAG() function to get previous line starttime or endtime.
  2. For each line calculate the different between current and previous time
  3. Filter rows with difference more than 5 minutes.
  4. Transform the dataset into required output.

Example:

    insert into yourtable
    select 
    s.rowid, 
    s.starttime ,
    s.endtime,
    --calculate your status here, etc, etc
from    
    (
    select rowid starttime endtime, 
           lag(endtime) over(partition by rowid order by starttime) prev_endtime
    from yourtable ) s
    where (unix_timestamp(endtime) - unix_timestamp(prev_endtime))/60 > 5 --latency>5 min