i am currently evaluating Spark 2.1.0 on a small cluster (3 Nodes with 32 CPUs and 128 GB Ram) with a benchmark in linear regression (Spark ML). I only measured the time for the parameter calculation (not including start, data loading, …) and recognized the following behavior. For small datatsets 0.1 Mio – 3 Mio datapoints the measured time is not really increasing and stays at about 40 seconds. Only with larger datasets like 300 Mio datapoints the processing time went up to 200 seconds. So it seems, the cluster does not scale at all to small datasets.
I also compared the small dataset on my local pc with the cluster using only 10 worker and 16GB ram. The processing time of the cluster is larger by a factor of 3. So is this considered normal behavior of SPARK and explainable by communication overhead or am I doing something wrong (or is linear regression not really representative)?
The cluster is a standalone cluster (without Yarn or Mesos) and the benchmarks where submitted with 90 worker, each with 1 core and 4 GB ram.
Spark submit: ./spark-submit --master spark://server:7077 --class Benchmark --deploy-mode client --total-executor-cores 90 --executor-memory 4g --num-executors 90 .../Benchmark.jar pathToData