13
votes

How can I generate row numbers for an existing table while running a select query?
For example:

select row_number(), * from emp;

I am using hive 0.13. I can't access external jars or udfs in my environment. The underlying files are in parquet format.

Thanks in advance!

2
If you try to run this kind of analytic functions on large data sets (i.e. over 50 million rows) then be careful to test your data consistency. I have seen subtle data corruption occur in a deterministic way with V0.13 and V0.14 -- the row numbers were in sequence but some thousands of rows had been dropped and replaced by a copy of other rows. But that may be specific to Hive-on-TEZ.Samson Scharfrichter

2 Answers

32
votes

ROW_NUMBER() is a windowing function so it needs to be used in conjunction with an OVER clause. Just don't specify any PARTITION.

SELECT *, ROW_NUMBER() OVER () AS row_num
FROM emp
--- other stuff
14
votes

row_number() can be used to find for example, recent visit of a user on your site.

SELECT user_id,user_name,timestamp
FROM (
SELECT user_id,user_name,timestamp,row_number() over (partition by userid order by timestamp desc) as visit_number 
from user) user_table
    WHERE visit_number = 1