Python vs Scala (for Spark jobs)

Question

I am pretty new to Spark, currently exploring it by playing with pyspark and spark-shell.

So here is the situation, I run same spark jobs with pyspark and spark-shell.

This is from pyspark:

textfile = sc.textFile('/var/log_samples/mini_log_2')
textfile.count()

And this one from spark-shell:

textfile = sc.textFile("file:///var/log_samples/mini_log_2")
textfile.count()

I tried both of them several times, first (python) one takes 30-35 seconds to complete while second one (scala) takes about 15 seconds. I am curious about what may cause this different performance results? Is it because of choice of language or spark-shell do something in background that pyspark don't?

UPDATE

So I did some tests on larger datasets, about 550 GB (zipped) in total. I am using Spark Standalone as master.

I observed that while using pyspark, tasks are equally shared among executors. However when using spark-shell, tasks are not shared equally. More powerful machines get more tasks while weaker machines gets fewer tasks.

With spark-shell, job is finished in 25 minutes and with pyspark it is around 55 minutes. How can I make Spark Standalone assign tasks with pyspark, as it assigns tasks with spark-shell?

spark-shell

Pyspark

Ophir Yoktan Ophir Yoktan · Accepted Answer · 2015-05-27T09:49:18

Using python has some overhead, but it's significance depends on what you're doing. Though recent reports indicate the overhead isn't very large (specifically for the new DataFrame API)

some of the overhead you encounter relates to constant per job overhead - which is almost irrelevant for large jobs. You should to do a sample benchmark with a larger data set, and see if the overhead is a constant addition, or if it's proportional to the data size.

Another potential bottleneck is operations that apply a python function for each element (map, etc.) - if these operations are relevant for you, you should test them too.

Python vs Scala (for Spark jobs)

1 Answers