1
votes

I am trying to update records into hive table. I am creating hive table from another existing hive table using following syntax (This is how we need to create table into automated script):

CREATE TABLE employee_master STORED as ORC as SELECT * FROM EMPLOYEE_SAMPLE;
ALTER TABLE employee_master clustered by (employeeid) into 2 buckets;
ALTER TABLE employee_master SET TBLPROPERTIES ('transactional'='true');`

After running above commands, employee_master table get created with ACID supports as following: enter image description here I loaded 100 records from another similar table. 10 sample records from the table are as following:

  • "employeeid","name","jobtitle","department","annualsalary"
  • "100","ADAMCZYK LAWRENCE M","FIREFIGHTER","FIRE","79926"
  • "52","ACCIARI NICHOLAS B","MOTOR TRUCK DRIVER","STREETS & SAN","71469"
  • "26","ABRAHAM DALINA D","COMMUNITY SERVICES","FAMILY & SUPPORT","69684"
  • "54","ACEVEDO AARON F","POLICE OFFICER","POLICE","60918"
  • "14","ABDOLLAHZADEH ALI","PARAMEDIC","FIRE","64374"
  • "56","ACEVEDO BIENVENIDO","POLICE OFFICER","POLICE","77238"
  • "28","ABRAHAM GODWIN K","TECHNICIAN V","BUSINESS AFFAIRS","84888"
  • "58","ACEVEDO DAILHA","DISEASE CONTROL INVESTIGATOR II","HEALTH","48108"
  • "8","ABBOTT BETTY L","FOSTER GRANDPARENT","FAMILY & SUPPORT","2756"
  • "60","ACEVEDO EDWIN C","FIREFIGHTER","FIRE","83148"

When I ran select * on the table, I can see all 100 records and select count (*) command also showing count as 100, i.e. table contains 100 records.

enter image description here

Directory structure for the employee_master table is shown as following: enter image description here

Now I want to update 5 records from above sample records one by one using following queries:

Update employee_master SET department="FIRE DEPARTMENT" where employeeid=100;
Update employee_master SET department="STREETS & SAN DEPARTMENT" where employeeid=52;
Update employee_master SET department="FAMILY & SUPPORT SERVICES" where employeeid=26;
Update employee_master SET department="POLICE DEPARTMENT" where employeeid=54;
Update employee_master SET department="FIRE DEPARTMENT" where employeeid=14;

While updating employee_master table as per the HIVE transaction documents, one delta file is created corresponding to one row update. After updating all above records, file structure for the employee_master table is shown as following:enter image description here

While running, count () and statistics command, I still see 100 records as output. After Major Compaction ran (manfully or automatically) on the table, I can see only 11 records in the table as an output of select () command. Means, when I run select () from employee_master, I can see only 11 records as following while which is not correct. Output should show all 100 records as before updating the table; enter image description here While running count() and statistics command as below; is showing count as 100 records as following:

enter image description here

I have tried this exercise numerous times in the same way on different- different tables/servers but all showing same unexpected output. Not sure if I am missing something or this is a bug.

NOTE: I tried once creating ORC transactional table using create table syntax as following:

CREATE IF NOT EXISTS employee_master (
    employeeid INT,
    name string,
    jobtitle string,
    department string,
    annualsalary DECIMAL(10,2))
COMMENT "Employee Master - Table"
CLUSTERED BY (employeeid) INTO 2 BUCKETS
STORED AS ORC 
TBLPROPERTIES ("transactional"="true")

Then I did not face above issue but this is not fitting into our current requirement.

I am using following software versions:

  • HDP-2.4.0.0-169
  • Hive 1.2.1000
  • Centos 6.8
  • Running Hive on TEZ client
2

2 Answers

1
votes

Would updating Hive to >= 1.3 be an option for you? According to the documentation:

Schema changes using ALTER TABLE is NOT supported for ACID tables. HIVE-11421 is tracking it. Fixed in 1.3.0/2.0.0.

Since you are using Hive 1.2.100 and you are trying to execute the ALTER TABLE commands, I think the issues you're having are related to that.

1
votes

You cannot do

ALTER TABLE employee_master clustered by (employeeid) into 2 buckets; 

after the table is created has data in it, at least not transactional=true table. This will not 're-bucket' the table that was not loaded properly in the first place.

Why not do

CREATE IF NOT EXISTS employee_master (
    employeeid INT,
    name string,
    jobtitle string,
    department string,
    annualsalary DECIMAL(10,2))
COMMENT "Employee Master - Table"
CLUSTERED BY (employeeid) INTO 2 BUCKETS
STORED AS ORC 
TBLPROPERTIES ("transactional"="true")

and then

INSERT INTO employee_master SELECT * FROM EMPLOYEE_SAMPLE;

this should work just fine. It will properly populate the table and you'll be able to run your Update commands on it.