I am experiencing a high latency between Spark nodes and HBase nodes. The current resources I have require me to run HBase and Spark on different servers.
The HFiles are compressed with Snappy algorithm, which reduces the data size of each region from 50GB to 10GB.
Nevertheless, the data transferred on the wire is always decompressed, so reading takes lot of time - approximately 20 MB per sec, which is about 45 minutes for each 50GB region.
What can I do to make data reading faster? (Or, is the current throughput considered high for HBase?)
I was thinking to clone the HBase HFiles locally to the Spark machines, instead of continuously requesting data from the HBase. Is it possible?
What is the best practice for solving such an issue?
Thanks