How do I get a small Python script to hook into an existing instance of Spark and do operations on existing RDDs?
I'm in the early stages of working with Spark on Windows 10, trying scripts on a "Local" instance. I'm working with the latest stable build of Spark (Spark 2.0.1 for Hadoop 2.7). I've installed and set environment variables for Hadoop 2.7.3. I'm experimenting with both the Pyspark shell and Visual Studio 2015 Community with Python.
I'm trying to build a large engine, on which I'll run individual scripts to load, massage, format, and access the data. I'm sure there's a normal way to do that; isn't that the point of Spark?
Anyway, here's the experience I have so far. This is generally to be expected. When I build a small Spark script in Python and run it using Visual Studio, the script runs, does its job, and exits. In the process of exiting, it also exits the Spark Context it was using.
So I had the following thought: What if I started a persistent Spark Context in Pyspark and then set my SparkConf and SparkContext in each Python script to connect to that Spark Context? So, looking up online what the defaults are for Pyspark, I tried the following:
conf = SparkConf().setMaster("local[*]").setAppName("PySparkShell")
sc = SparkContext(conf = conf)
I started Pyspark. In a separate script in Visual Studio, I used this code for SparkContext. I loaded a text file into an RDD named RDDFromFilename . But I couldn't access that RDD in the Pyspark shell once the script had run.
How do I start a persistent Spark Context, create an RDD in it in one Python script, and access that RDD from subsequent Python scripts? Particularly in Windows?