I have two views in my hive
+------------+
| Table_1 |
+------------+
| hash |
| campaignId |
+------------+
+-----------------+
| Table_2 |
+-----------------+
| campaignId |
| accountId |
| parentAccountID |
+-----------------+
Now I have to fetch 'Table_1' data filtered by accountId & parentAccountID, for which I have written the following query:
SELECT /*+ MAPJOIN(T2) */ T1.hash, COUNT(T1.campaignId) num_campaigns
FROM Table_1 T1
JOIN Table_2 T2 ON T1.campaignId = T2.campaignId
WHERE (T2.accountId IN ('aid1', 'aid2') OR T2.parentAccountID IN ('aid1', 'aid2')
GROUP BY T1.hash
This query is working but slow. Is there any better alternative to this (JOIN)?
I am reading Table_1 from kafka through spark.
Slide Duration is 5 sec
Window Duration is 2 minutes
While Table_2 is in RDBMS which spark is reading through jdbc and this has 4500 records.
Every 5 sec, kafka pumps in approximately 2K records in CSV format.
I need the data to be processed within 5 seconds but currently its taking between 8 to 16 seconds.
As per suggestions:
- I have repartitioned Table_1 by columns campaignId & hash respectively.
- I have repartitioned Table_2 by columns accountId & parentAccountID respectively.
- I have implemented MAPJOIN.
But still no improvement.
NOTE: If I remove the window duration, then the process does get executed within time. May be because of less data to process. But that is not the requirement.