I am running a local environment with Spark, PySpark, Ipython and mysql. I am strugling to be able to launch a mysql query via spark. The main issue is including the proper jdbc jar in order to be able to perform the query.
Here is what I have so far :
import pyspark
conf = (pyspark.SparkConf()
.setMaster('local')
.setAppName('Romain_DS')
.set("spark.executor.memory", "1g")
.set("spark.driver.extraLibraryPath","mysql-connector-java-5.1.37/mysql-connector-java-5.1.37-bin.jar")
.set("spark.driver.extraClassPath","mysql-connector-java-5.1.37/mysql-connector-java-5.1.37-bin.jar")
)
sc = pyspark.SparkContext(conf=conf)
This is in order to properly create the spark context, and properly show the path to the jar including the jdbc driver.
Then I create an SQLContext :
from pyspark.sql import SQLContext
sqlsc=SQLContext(sc)
And finally the query :
MYSQL_USERNAME = "root";
MYSQL_PWD = "rootpass";
MYSQL_CONNECTION_URL = "jdbc:mysql://127.0.0.1:33060/O_Tracking?user=" + MYSQL_USERNAME + "&password=" + MYSQL_PWD;
query = 'Select * from tracker_action'
dataframe_mysql = sqlsc.read.format("jdbc").options(
url = MYSQL_CONNECTION_URL,
dbtable = "tracker_action",
driver = "com.mysql.jdbc.Driver",
user="root",
password="rootpass").load()
If I run this in the ipython notebook I get the error :
An error occurred while calling o198.load. : java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
However if I do everything from the shell (and not ipython), by initializing the spark context this way :
pyspark --driver-library-path './mysql-connector-java-5.1.37-bin.jar' --driver-class-path './mysql-connector-java-5.1.37-bin.jar'
It does work... I looked into the UI in Spark the configurations are the same. So I don't understand why would one work and not the other one ... Is there something to do with the runtime setting before the JVM ?
If I cannot find a proper solution, we could potentially think of running the sc in the shell and then use it from the ipython but I have no idea how to do that.
If someone can help me on that that would be great.
---- Hardware / Software Mac OSX
Spark 1.5.2
Java 1.8.0
Python 2.7.10 :: Anaconda 2.3.0 (x86_64)
---- Sources to help:
https://gist.github.com/ololobus/4c221a0891775eaa86b0 http://spark.apache.org/docs/latest/configuration.html
Following the comments here is my conf file :
# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.
spark.driver.extraLibraryPath /Users/romainbui/mysql-connector-java-5.1.37/mysql-connector-java-5.1.37-bin.jar
spark.driver.extrClassPath /Users/romainbui/mysql-connector-java-5.1.37/mysql-connector-java-5.1.37-bin.jar
spark.AppName PySpark
spark.setMaster Local
--------- Solution --------- Thanks to the comments I finally was able to properly have a working solution (and a clean one).
Step 1 : Creating a profile :
ipython profile create pyspark
Step 2 : Edit the profile startup script :
touch ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py
Step 3 : Fill the file. Here I did something custom (thanks to the comments) :
import findspark
import os
import sys
findspark.init()
spark_home = findspark.find()
#spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
# Adding the library to mysql connector
packages = "mysql:mysql-connector-java:5.1.37"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--packages {0} pyspark-shell".format(
packages
)
# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
Then you can simply run the notebook with :
ipython notebook --profile=pyspark