2
votes

I am experiencing a high latency between Spark nodes and HBase nodes. The current resources I have require me to run HBase and Spark on different servers.

The HFiles are compressed with Snappy algorithm, which reduces the data size of each region from 50GB to 10GB.

Nevertheless, the data transferred on the wire is always decompressed, so reading takes lot of time - approximately 20 MB per sec, which is about 45 minutes for each 50GB region.

What can I do to make data reading faster? (Or, is the current throughput considered high for HBase?)

I was thinking to clone the HBase HFiles locally to the Spark machines, instead of continuously requesting data from the HBase. Is it possible?

What is the best practice for solving such an issue?

Thanks

1
Is it the reading of files from Disks takes time or transfer of data over network takes time? Please do mention the hardware/ network configurations, cluster config and the way you are reading the Hbase data from Spark.Sumit
The data transfer over the networks takes time. Data is not read from Disk. 4x 16-cores 32GB RAM servers, 10GBps network connection, each server hosts 16 Spark workers. The cluster is Spark Standalone. Reading from HBase using the standard TableInputFormat.imriqwe

1 Answers

1
votes

You are thinking in right direction. You can copy HFiles to HDFS cluster(or Machines) where spark is running. That would result in saving decompression and reduced data transfer over the wire. You would need to read HFiles from Snappy compression and write a parser to read.

Alternatively you can apply Column and ColumnFamily filters if you don't need all the data from Hbase.