Alternative of hive join

Question

I have two views in my hive

+------------+
| Table_1    |
+------------+
| hash       |
| campaignId |
+------------+

+-----------------+
| Table_2         |
+-----------------+
| campaignId      |
| accountId       |
| parentAccountID |
+-----------------+

Now I have to fetch 'Table_1' data filtered by accountId & parentAccountID, for which I have written the following query:

SELECT /*+ MAPJOIN(T2) */ T1.hash, COUNT(T1.campaignId) num_campaigns
FROM Table_1 T1
JOIN Table_2 T2 ON T1.campaignId = T2.campaignId
WHERE (T2.accountId IN ('aid1', 'aid2') OR T2.parentAccountID IN ('aid1', 'aid2')
GROUP BY T1.hash

This query is working but slow. Is there any better alternative to this (JOIN)?

I am reading Table_1 from kafka through spark.
Slide Duration is 5 sec
Window Duration is 2 minutes

While Table_2 is in RDBMS which spark is reading through jdbc and this has 4500 records.

Every 5 sec, kafka pumps in approximately 2K records in CSV format.
I need the data to be processed within 5 seconds but currently its taking between 8 to 16 seconds.

As per suggestions:

I have repartitioned Table_1 by columns campaignId & hash respectively.
I have repartitioned Table_2 by columns accountId & parentAccountID respectively.
I have implemented MAPJOIN.

But still no improvement.

NOTE: If I remove the window duration, then the process does get executed within time. May be because of less data to process. But that is not the requirement.

(1) How would you define "slow"? (2) What volumes are we talking about? — David דודו Markovitz
I am using this in spark streaming which process data every 5 sec. But its taking more than 10 sec to process each batch. (Note: Slide duration is 5 sec but window duration is 10 Sec). Each batch has approx 24K records. — Sandeep Kumar Roy
Are you doing this query for every RDD (of DStream) on every stream? Because I feel creating HiveContext & Hive query during every stream can be expensive. — KiranM

Gordon Linoff Gordon Linoff · Accepted Answer · 2017-02-20T13:24:39

With the right indexes, the following can be faster:

SELECT T1.*
FROM Table_1 T1 JOIN
     Table_2 T2
     ON T1.campaignId = T2.campaignId
WHERE T2.accountId IN ('aid1', 'aid2') 
UNION ALL
SELECT T1.*
FROM Table_1 T1 JOIN
     Table_2 T2
     ON T1.campaignId = T2.campaignId
WHERE T2.parentAccountID IN ('aid1', 'aid2') AND
      T2.accountId NOT IN ('aid1', 'aid2') ;

The first can take account of an index on Table_2(accountId, campaignId) and the second on Table_2(parentAccountID, accountId, campaignId).

Alternative of hive join

5 Answers