0
votes

I am a newbie to hive transactions thats supported since hive 0.14.

I created a transactional table in hive as follows

create table test_transactional(id int,name string) clustered by (id) into 2 buckets stored as orc TBLPROPERTIES('transactional'='true');

I also created a table with some sample data that has id, string columns.

create table stage(id int, name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Then i inserted the data from the stage table to the test_transactional table as follows,

insert into table test_transactional select * from stage;

I also ran an update statement to see if its working. So far so good.

Questions

  1. When i do an ls on the hdfs folder for test_transactional table, i dont see base folder but only the delta folder. As per the documentation, my understanding is that the base folder should be present. Or base folder would be created only when major compaction is run?

  2. The table gets updated when the update/insert/delete statements are run explicitly, but in a typical ETL, when i have data in the staging table, how do i determine if its an update, insert or delete? Do i need to set some properties so that hive can do this automatically? If not how can this be handled?

1

1 Answers

0
votes
  1. Yes you are right base folder is created only when the major compaction runs. You can run it manually by firing following command

Alter Table[table name]Partition[Partition name]compact='major'

  1. We should have 3 columns (createdon, modifiedon, deleteflag) to track (insert, update, delete) respectively in base system from where we are pulling data into staging table.

Limitation: Base system should not delete any records rather it should set deleteflag = 1