Creating an instance of java hbase client on each Apache Spark worker node

Question

Working with Spark Structured Streaming.

I am working on a code where I need to do a lot of lookups on data. Lookups are very complex and just don't translate too well to joins.

e.g. look up field A in Table B and fetch a value, if found lookup that values in another table. if not found lookup some other value C in table D and then so on and so forth.

I managed to write these lookups using HBase and it works fine, functionally. I wrote udfs for each of these lookups e.g. a very simple one might be:

val someColFunc= udf( (code:String) =>
        {
            val value = HbaseObject.table.getRow("lookupTable", code, "cf", "value1")
            if (value != null)
                Bytes.toString(value)
            else
                null
        }
    )

Since java hbase client is non serializable. I am passing Hbase object like this

object HbaseObject {
 val table = new HbaseUtilities(zkUrl)
}

HbaseUtilities is a class I wrote to simplify lookups. It just creates a java HBase client and provides a wrapper for the kind of get commands I need.

This is rendering my code too slow, which too, is alright. What's puzzling me, is that increasing or decreasing the number of executors or cores is having no effect on the speed of my code. be it 1 executor or 30 it's running at the exact same rate. Which makes me believe there is lack of parallelism. So all my workers must be sharing the same Hbase object. Is their a way I can instantiate one such object on each worker before they start executing? I have already tried using lazy val, it's not having any effect

I have even tried creating a sharedSingleton as shown here https://www.nicolaferraro.me/2016/02/22/using-non-serializable-objects-in-apache-spark/, which solved some problems for me but not the loss of parallelism.

I know there might be better ways to solve the problem and all suggestions are very welcome but right now I'm caught in a few constraints and a tight timeline.

jotarada jotarada · Accepted Answer · 2021-01-19T14:28:14

You need to create all non serializable objects in the executor. you can use foreachPartition or mapPartitions to create a connection in each executor.

Something similar to this (i'm using hbase client 2.0.0):

 import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.hbase.client.{Connection, ConnectionFactory, Get, Put, Result}
 import org.apache.hadoop.hbase.util.Bytes
 import org.apache.hadoop.hbase.{HBaseConfiguration, TableName}


df.foreachPartition(
partition => {
  //foreach executor create the connection and the table
  val config: Configuration = HBaseConfiguration.create()
  config.set("hbase.zookeeper.quorum", "zk url")
  val connection: Connection = ConnectionFactory.createConnection(config)
  val table = connection.getTable(TableName.valueOf("tableName"))
  partition.map(
    record => {
      val byteKey = Bytes.toBytes(record.getString(0))
      val get = new Get(byteKey)
      val result = table.get(get)
      //DO YOUR LOGIC HERE FOR EACH RECORD
    }
  ).toList
  table.close()
  connection.close()
}
)

df is the dataframe for each record you want to do the lookup.

You can create as many tables you need for each executor for the same connection.

As you create all the objects in executors you don't need to deal with non serializable problems. You can have it in a class like your HbaseUtilities to be used there but you need to create a new instance only inside the foreach/map partitions

Creating an instance of java hbase client on each Apache Spark worker node

2 Answers