0
votes

We are trying to create a HIVE table which will include a sequence that will act as a unique identifier. This table will have data appended to it daily and we would like to keep the sequence running.

For the initial load we used 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence' this will obviously not be enough for future keys as it always starts from 0.

We thought about storing the max key from the table before every run and then add it to the sequence on every load operation but can't seem to make it work in HIVE. The problem we are facing is we cannot save the result of a query into a variable

Ideally it would do the following (pseudo code):

  1. create temporary function row_sequence as 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
  2. store the current table max sequence into a variable

    currentMaxSequence = "Select max(sequenceKey) as max from justAnyTable"

  3. Append the max sequence+ current sequence to the new rows

    INSERT INTO TABLE anotherTable SELECT currentMaxSequence + row_sequence() as sk, name from (SELECT name from test limit 10) n;

any ideas how this can be done, specifically how can we store a result set from a query (one value) and use it in another one, Hive doesn't seem to like this nested queries.

Thanks, Roy

2

2 Answers

2
votes

You can store the result of max in a variable. Try using following command:

hive -e 'select concat('set maxValue=',t.maxvalue) from (select max(columnName) from tableName) t' >> /path/to/directory/storeMaxValue.sql

Then for the next run, run this command first:

hive -f /path/to/directory/storeMaxValue.sql;

Your variable is now available in the session. But this thing will not help you. Instead of this you try to store your max result in a table then use the result of the table as a join in the successive runs.

1
votes

Use a join.

INSERT INTO TABLE anotherTable
SELECT n1.max + row_sequence() as sk, name 
from (SELECT name from test limit 10) n
join
(Select max(sequenceKey) as max from justAnyTable) n1