De-duplication from two hive tables

Question

We are stuck with a problem where-in we are trying to do a near real time sync between a RDBMS(Source) and hive (Target). Basically the source is pushing the changes (inserts, updates and deletes) into HDFS as avro files. These are loaded into external tables (with avro schema), into the Hive. There is also a base table in ORC, which has all the records that came in before the Source pushed in the new set of records.

Once the data is received, we have to do a de-duplication (since there could be updates on existing rows) and remove all deleted records (since there could be deletes from the Source).

We are now performing a de-dupe using rank() over partitioned keys on the union of external table and base table. And then the result is then pushed into a new table, swap the names. This is taking a lot of time.

We tried using merges, acid transactions, but rank over partition and then filtering out all the rows has given us the best possible time at this moment.

Is there a better way of doing this? Any suggestions on improving the process altogether? We are having quite a few tables, so we do not have any partitions or buckets at this moment.

notNull notNull · Accepted Answer · 2018-03-16T20:10:27

You can try with storing all the transactional data into Hbase table.

Storing data into Hbase table using Primary key of RDBMS table as Row Key:-

Once you pull all the data from RDBMS with NiFi processors(executesql,Querydatabasetable..etc) we are going to have output from the processors in Avro format.

You can use ConvertAvroToJson processor and then use SplitJson Processor to split each record from array of json records.

Store all the records in Hbase table having Rowkey as the Primary key in the RDBMS table.

As when we get incremental load based on Last Modified Date field we are going to have updated records and newly added records from the RDBMS table.

If we got update for the existing rowkey then Hbase will overwrite the existing data for that record, for newly added records Hbase will add them as a new record in the table.

Then by using Hive-Hbase integration you can get the Hbase table data exposed using Hive.

https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration

By using this method we are going to have Hbase table that will take care of all the upsert operations and we cannot expect same performance from hive-hbase table vs native hive table will perform faster,as hbase tables are not meant for sql kind of queries, hbase table is most efficient if you are accessing data based on Rowkey,

if we are going to have millions of records then we need to do some tuning to the hive queries

Tuning Hive Queries That Uses Underlying HBase Table

De-duplication from two hive tables

1 Answers