For a Spark execution in pyspark two components are required to work together:
pyspark
python package
- Spark instance in a JVM
When launching things with spark-submit or pyspark, these scripts will take care of both, i.e. they set up your PYTHONPATH, PATH, etc, so that your script can find pyspark, and they also start the spark instance, configuring according to your params, e.g. --master X
Alternatively, it is possible to bypass these scripts and run your spark application directly in the python interpreter likepython myscript.py
. This is especially interesting when spark scripts start to become more complex and eventually receive their own args.
- Ensure the pyspark package can be found by the Python interpreter. As already discussed either add the spark/python dir to PYTHONPATH or directly install pyspark using pip install.
- Set the parameters of spark instance from your script (those that used to be passed to pyspark).
- For spark configurations as you'd normally set with --conf they are defined with a config object (or string configs) in SparkSession.builder.config
- For main options (like --master, or --driver-mem) for the moment you can set them by writing to the PYSPARK_SUBMIT_ARGS environment variable. To make things cleaner and safer you can set it from within Python itself, and spark will read it when starting.
- Start the instance, which just requires you to call
getOrCreate()
from the builder object.
Your script can therefore have something like this:
from pyspark.sql import SparkSession
if __name__ == "__main__":
if spark_main_opts:
# Set main options, e.g. "--master local[4]"
os.environ['PYSPARK_SUBMIT_ARGS'] = spark_main_opts + " pyspark-shell"
# Set spark config
spark = (SparkSession.builder
.config("spark.checkpoint.compress", True)
.config("spark.jars.packages", "graphframes:graphframes:0.5.0-spark2.1-s_2.11")
.getOrCreate())