fetch data from hive table into spark and perform join on RDDs

Question

I have two tables in hive/impala. I want to fetch the data from the table into spark as rdds and perform say a join operation.

I do not want to directly pass the join query in my hive context. This is just an example. I have more use cases that are not possible by a standard HiveQL. How do I fetch all rows, access the columns and perform transformation.

Suppose I have two rdds:

val table1 =  hiveContext.hql("select * from tem1")

val table2 =  hiveContext.hql("select * from tem2")

I want to perform a join on the rdds on a column called "account_id"

Ideally I want to do something like this using the rdds using spark shell.

select * from tem1 join tem2 on tem1.account_id=tem2.account_id;

Daniel de Paula Daniel de Paula · Accepted Answer · 2016-05-03T20:30:10

I'm not sure I understood the question, but as an alternative you can use the API to join DataFrames, so you can have many things decided programatically (e.g. the join function can be passed as parameter to a method that applies a custom transformation).

For your example, it would be like this:

val table1 =  hiveContext.sql("select * from tem1")
val table2 =  hiveContext.sql("select * from tem2")
val common_attributes = Seq("account_id")
val joined = table1.join(table2, common_attributes)

There are many common transformations available in the DataFrame API: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame

Cheers

fetch data from hive table into spark and perform join on RDDs

4 Answers