1
votes
create table db.temp
location '/user/temp' as
SELECT t1.mobile_no
FROM db.temp t1
WHERE NOT EXISTS ( SELECT NULL
                   FROM db.temp t2
                   WHERE t1.mobile_no = t2.mobile_no
                     AND t1.cell != t2.cell
                     AND t2.access_time BETWEEN t1.access_time
                                            AND t1.access_time_5);

I need to get all the users who used the same cell for 5 hours of the time interval(access_time_5) from access time. This code perfectly fine with impala. But not works in Hive.

Gives an error

"Error while compiling statement: FAILED: SemanticException [Error 10249]: line 23:25 Unsupported SubQuery Expression"

I looked at a similar question related to this error. Can't figure out the solution. Any help would be highly appreciated!

2
Hi @mck. I tried this method. But it not worked! - S.Abeyrathne
correlated BETWEEN is not supported in Hive. - leftjoin
Hi @leftjoin! Any solution for getting rid of this issue? - S.Abeyrathne
Could you post some data example? - leftjoin

2 Answers

0
votes

Correlated BETWEEN is not supported in Hive as well as non-equi joins. Try to rewrite using LEFT JOIN, count rows with your condition and filter:

select mobile_no from
(
SELECT t1.mobile_no, 
       sum(case when t1.cell != t2.cell
                 and t2.access_time between t1.access_time and t1.access_time_5
               then 1 else 0
           end) as cnt_exclude
  FROM db.temp t1
       LEFT JOIN db.temp t2 on t1.mobile_no = t2.mobile_no
 GROUP BY t1.mobile_no
)s
where cnt_exclude=0

The problem with such solution is that LEFT JOIN may produce huge duplication and it will affect performance, though it may work if the data is not too big.

0
votes

It seems to me that window functions would be better for both databases. Let me assume that access_time is a Unix time (i.e. measured in seconds). You can easily convert the value to such a time:

SELECT t1.mobile_no
FROM (SELECT t1.*,
             MIN(t1.cell) OVER (PARTITION BY mobile_no
                                ORDER BY access_time
                                RANGE BETWEEN 17999 preceding AND CURRENT ROW
                               ) as min_cell,
             MAX(t1.cell) OVER (PARTITION BY mobile_no
                                ORDER BY access_time
                                RANGE BETWEEN 17999 preceding AND CURRENT ROW
                               ) as max_cell
      FROM db.temp t1
     ) t1
WHERE min_cell = max_cell;