I am trying to update records into hive table. I am creating hive table from another existing hive table using following syntax (This is how we need to create table into automated script):
CREATE TABLE employee_master STORED as ORC as SELECT * FROM EMPLOYEE_SAMPLE;
ALTER TABLE employee_master clustered by (employeeid) into 2 buckets;
ALTER TABLE employee_master SET TBLPROPERTIES ('transactional'='true');`
After running above commands, employee_master table get created with ACID supports as following: I loaded 100 records from another similar table. 10 sample records from the table are as following:
- "employeeid","name","jobtitle","department","annualsalary"
- "100","ADAMCZYK LAWRENCE M","FIREFIGHTER","FIRE","79926"
- "52","ACCIARI NICHOLAS B","MOTOR TRUCK DRIVER","STREETS & SAN","71469"
- "26","ABRAHAM DALINA D","COMMUNITY SERVICES","FAMILY & SUPPORT","69684"
- "54","ACEVEDO AARON F","POLICE OFFICER","POLICE","60918"
- "14","ABDOLLAHZADEH ALI","PARAMEDIC","FIRE","64374"
- "56","ACEVEDO BIENVENIDO","POLICE OFFICER","POLICE","77238"
- "28","ABRAHAM GODWIN K","TECHNICIAN V","BUSINESS AFFAIRS","84888"
- "58","ACEVEDO DAILHA","DISEASE CONTROL INVESTIGATOR II","HEALTH","48108"
- "8","ABBOTT BETTY L","FOSTER GRANDPARENT","FAMILY & SUPPORT","2756"
- "60","ACEVEDO EDWIN C","FIREFIGHTER","FIRE","83148"
When I ran select * on the table, I can see all 100 records and select count (*) command also showing count as 100, i.e. table contains 100 records.
Directory structure for the employee_master table is shown as following:
Now I want to update 5 records from above sample records one by one using following queries:
Update employee_master SET department="FIRE DEPARTMENT" where employeeid=100;
Update employee_master SET department="STREETS & SAN DEPARTMENT" where employeeid=52;
Update employee_master SET department="FAMILY & SUPPORT SERVICES" where employeeid=26;
Update employee_master SET department="POLICE DEPARTMENT" where employeeid=54;
Update employee_master SET department="FIRE DEPARTMENT" where employeeid=14;
While updating employee_master table as per the HIVE transaction documents, one delta file is created corresponding to one row update. After updating all above records, file structure for the employee_master table is shown as following:
While running, count () and statistics command, I still see 100 records as output. After Major Compaction ran (manfully or automatically) on the table, I can see only 11 records in the table as an output of select () command. Means, when I run select () from employee_master, I can see only 11 records as following while which is not correct. Output should show all 100 records as before updating the table; While running count() and statistics command as below; is showing count as 100 records as following:
I have tried this exercise numerous times in the same way on different- different tables/servers but all showing same unexpected output. Not sure if I am missing something or this is a bug.
NOTE: I tried once creating ORC transactional table using create table syntax as following:
CREATE IF NOT EXISTS employee_master (
employeeid INT,
name string,
jobtitle string,
department string,
annualsalary DECIMAL(10,2))
COMMENT "Employee Master - Table"
CLUSTERED BY (employeeid) INTO 2 BUCKETS
STORED AS ORC
TBLPROPERTIES ("transactional"="true")
Then I did not face above issue but this is not fitting into our current requirement.
I am using following software versions:
- HDP-2.4.0.0-169
- Hive 1.2.1000
- Centos 6.8
- Running Hive on TEZ client