I have two tables. Both the tables are external tables in hive stored in parquet data format.
The first table table_1 has 250 Million rows on daily basis from year 2015. This table is partitioned based on create_date. So for every create_date, there is about 250M rows.
The second table - table_2 is daily delta table and the average row count is about 1.5 Million rows.
There is one common column "lookup_id" in both tables. Now I need to fetch all the columns from table_1 for the delta data from table_2 using data frames.
I thought of doing something like below
table_1=spark.table("table_1")
table_2=spark.table("table_2")
result_df=table_1.join(table_2, table_1.lookup_id=table_2.lookup_id, "inner").drop(table_2.lookup_id)
But I doubt if this is really efficient and if pyspark will be able to handle this without any memory errors.
Question 1: How to parallelize the table_1 scan based on create_date partitions?
Question 2: Is there any other way to optimize table_1 scan based on the lookup_ids from table_2 and/or based on partitions?
Additional info to give more clarity on what I am looking for:
I am trying to understand when we join the tables using dataframes, does spark read the data and keep it in memory and join them or it just joins while reading itself. If the second one is true, what are all the joins the second statement is applicable for. Also if there is any need to use loop to avoid any memory errors.
table1.create_date
andtable2.create_date
? For example, is this true that ifrow1.lookup_id == row2.lookup_id
thenrow1.create_date == row2.create_date
for row1 ∈ table1 and row2 ∈ table2 ? – gudok