Incremental updates in HIVE using SQOOP appends data into middle of the table

Question

I am trying to append the new data from SQLServer to Hive using the following command

sqoop import --connect 'jdbc:sqlserver://10.1.1.12;database=testdb' --username uname --password passwd --table testable --where "ID > 11854" --hive-import -hive-table hivedb.hivetesttable --fields-terminated-by ',' -m 1

This command appends the data.

But when I run

select * from hivetesttable;

it doesnot show the new data at the end.

This is because the sqoop import statement for appending the new data result the mapper output as part-m-00000-copy

So my data in the hive table directory looks like

part-m-00000

part-m-00000-copy

part-m-00001

part-m-00002

Is there any way to append the data at end by changing the name of mapper?

Jarek Jarcec Cecho Jarek Jarcec Cecho · Accepted Answer · 2016-01-11T16:16:16

Hive, similarly to any other relational database, doesn't guarantee any order unless you explicitly use ORDER BY clause.

You're correct in your analysis - the reason why the data appears in the "middle" is that Hive will read one file after another based on lexicographical sort and Sqoop simply names the files that they will get appended somewhere in the middle of that list.

However this operation is fully valid - Sqoop appended data to Hive table and because your query doesn't have any explicit ORDER BY statement the result have no guarantees with regards to order. In fact Hive itself can change this behavior and read files based on time of creation without breaking any compatibility.

I'm also interested to see how this is affecting your use case? I'm assuming that the query to list all rows is just a test one. Do you have any issues with actual production queries?

Incremental updates in HIVE using SQOOP appends data into middle of the table

1 Answers