Spark cluster does not scale to small data

Question

i am currently evaluating Spark 2.1.0 on a small cluster (3 Nodes with 32 CPUs and 128 GB Ram) with a benchmark in linear regression (Spark ML). I only measured the time for the parameter calculation (not including start, data loading, …) and recognized the following behavior. For small datatsets 0.1 Mio – 3 Mio datapoints the measured time is not really increasing and stays at about 40 seconds. Only with larger datasets like 300 Mio datapoints the processing time went up to 200 seconds. So it seems, the cluster does not scale at all to small datasets.

I also compared the small dataset on my local pc with the cluster using only 10 worker and 16GB ram. The processing time of the cluster is larger by a factor of 3. So is this considered normal behavior of SPARK and explainable by communication overhead or am I doing something wrong (or is linear regression not really representative)?

The cluster is a standalone cluster (without Yarn or Mesos) and the benchmarks where submitted with 90 worker, each with 1 core and 4 GB ram.

Spark submit: ./spark-submit --master spark://server:7077 --class Benchmark --deploy-mode client --total-executor-cores 90 --executor-memory 4g --num-executors 90 .../Benchmark.jar pathToData

I'm unsure if you are unhappy with the performance on the smaller 0.1-0.3M datasets, or the larger 300M dataset? — ImDarrenG
Hi, i'm not unhappy with the performance. I was just wondering if it is normal for a cluster to take half a minute for the computation even though the data are already loaded and quite small. — Andreas Bartschat
I would say that your observations are reasonable. I will provide a more detailed answer once I've had some sleep - if no one else does in the meantime. — ImDarrenG

ImDarrenG ImDarrenG · Accepted Answer · 2017-04-13T08:13:32

The optimum cluster size and configuration varies based on the data and the nature of the job. In this case, I think that your intuition is correct, the job seems to take disproportionately longer to complete on smaller dataset, because of the excess overhead given the size of the cluster (cores and executors).

Notice that increasing the amount of data by two orders of magnitude increases the processing time only 5-fold. You are increasing the data toward an optimum size for your cluster setup.

Spark is a great tool for processing lots of data, but it isn't going to be competitive with running a single process on a single machine if the data will fit. However it can be much faster than other distributed processing tools that are disk-based, where the data does not fit on a single machine.

I was at a talk a couple years back and the speaker gave an analogy that Spark is like a locomotive racing a bicycle:- the bike will win if the load is light, it is quicker to accelerate and more agile, but with a heavy load the locomotive might take a while to get up to speed, but it's going to be faster in the end. (I'm afraid I forget the speakers name, but it was at a Cassandra meetup in London, and the speaker was from a company in the energy sector).

Spark cluster does not scale to small data

2 Answers