0
votes

I need to do a join between 2 Oracle Tables and then process the data through Spark (in Java). What is the best option to do so? - Leverage on native Oracle join capability and so load a single dataset in Spark through a query like "select * from table1,table2 where table1.fk = table2.pk" OR - Leverage on Spark join functionalities loading 2 different datasets (one for each Oracle table) and then perform the join through the Dataset function Dataset.join ?

thank you!

1

1 Answers

1
votes

Before doing join operation in spark, you can create some performance matrix for join query in oracle and spark and based on observation and decide which one to choose. Few pointers for analysis on same,

  1. If data set fairly enough in size and joining operation will not add performance issue in oracle DB, then do join operation in source(oracle) itself.
  2. But when data set big enough means in GBs or TBs and if query is having performance issue when running on oracle if taking hours to perform the operation then definitely you have to think of using spark as its query latency is less compared to RDBMS(oracle).
  3. If oracle DB is transaction database and there are lots of transaction queries running for production critical application, then need to analyse if join operation chocking database or not. If this the issue then definitely think of offloading this data set in spark and perform join operation there.
  4. Do performance evaluation in oracle as well as spark and check if which one does join operation faster. Here oracle may be faster compared to spark if data size is small as though spark is in memory but query latency is in second to minutes and not in sub seconds.
  5. If data keeps growing for join tables in future and this is part of recurring batch job then you may not want to join operation every time at source and over burden it. If one have spark available in organization then we can offload such operations to spark.

Hope these pointers helpful to understand whether to go with spark or not.

Here I will keep my below answer for you choose between spark options.

You can do it either way create dataframe for each table and do join operation and after the register temp table in spark memory and do sql queries on top of it similar like oracle. There is no harm doing that. Another way is to create data set for each table and perform join operation using join funtion. Here both approach outcome will be same but from performance point of view data sets are more optimized as it will try to take additional advantage of rdd funtion, which is more optimized compared only spark data frame.

Below is sample code for dataset operation,

Dataset<Row> jdbcDF1 = spark.read()
  .format("jdbc")
  .option("url", "oracle.jdbc.driver.OracleDriver")
  .option("dbtable", "schema.table1")
  .option("user", "username")
  .option("password", "password")
  .load();

Dataset<Row> jdbcDF2 = spark.read()
  .format("jdbc")
  .option("url", "oracle.jdbc.driver.OracleDriver")
  .option("dbtable", "schema.table2")
  .option("user", "username")
  .option("password", "password")
  .load();

jdbcDF1.join(jdbcDF1, jdbcDF1.col("id").equalTo(jdbcDF2.col("id")))