3
votes

I am running a local environment with Spark, PySpark, Ipython and mysql. I am strugling to be able to launch a mysql query via spark. The main issue is including the proper jdbc jar in order to be able to perform the query.

Here is what I have so far :

import pyspark
conf = (pyspark.SparkConf()
        .setMaster('local')
        .setAppName('Romain_DS')
        .set("spark.executor.memory", "1g")
        .set("spark.driver.extraLibraryPath","mysql-connector-java-5.1.37/mysql-connector-java-5.1.37-bin.jar")
        .set("spark.driver.extraClassPath","mysql-connector-java-5.1.37/mysql-connector-java-5.1.37-bin.jar")
    )
sc = pyspark.SparkContext(conf=conf)

This is in order to properly create the spark context, and properly show the path to the jar including the jdbc driver.

Then I create an SQLContext :

from pyspark.sql import SQLContext
sqlsc=SQLContext(sc)

And finally the query :

MYSQL_USERNAME = "root";
MYSQL_PWD = "rootpass";
MYSQL_CONNECTION_URL = "jdbc:mysql://127.0.0.1:33060/O_Tracking?user=" + MYSQL_USERNAME + "&password=" + MYSQL_PWD;
query = 'Select * from tracker_action'

dataframe_mysql = sqlsc.read.format("jdbc").options(
    url = MYSQL_CONNECTION_URL,
    dbtable = "tracker_action",
    driver = "com.mysql.jdbc.Driver",
    user="root",
    password="rootpass").load()

If I run this in the ipython notebook I get the error :

An error occurred while calling o198.load. : java.lang.ClassNotFoundException: com.mysql.jdbc.Driver

However if I do everything from the shell (and not ipython), by initializing the spark context this way :

pyspark --driver-library-path './mysql-connector-java-5.1.37-bin.jar' --driver-class-path './mysql-connector-java-5.1.37-bin.jar'

It does work... I looked into the UI in Spark the configurations are the same. So I don't understand why would one work and not the other one ... Is there something to do with the runtime setting before the JVM ?

If I cannot find a proper solution, we could potentially think of running the sc in the shell and then use it from the ipython but I have no idea how to do that.

If someone can help me on that that would be great.

---- Hardware / Software Mac OSX

Spark 1.5.2

Java 1.8.0

Python 2.7.10 :: Anaconda 2.3.0 (x86_64)

---- Sources to help:

https://gist.github.com/ololobus/4c221a0891775eaa86b0 http://spark.apache.org/docs/latest/configuration.html

Following the comments here is my conf file :

# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.
spark.driver.extraLibraryPath   /Users/romainbui/mysql-connector-java-5.1.37/mysql-connector-java-5.1.37-bin.jar
spark.driver.extrClassPath  /Users/romainbui/mysql-connector-java-5.1.37/mysql-connector-java-5.1.37-bin.jar
spark.AppName   PySpark
spark.setMaster Local

--------- Solution --------- Thanks to the comments I finally was able to properly have a working solution (and a clean one).

Step 1 : Creating a profile :

ipython profile create pyspark

Step 2 : Edit the profile startup script :

touch ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py

Step 3 : Fill the file. Here I did something custom (thanks to the comments) :

import findspark
import os
import sys
findspark.init()
spark_home = findspark.find()

#spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")

# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))

# Adding the library to mysql connector
packages = "mysql:mysql-connector-java:5.1.37"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--packages {0} pyspark-shell".format(
    packages
)

# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))

Then you can simply run the notebook with :

ipython notebook --profile=pyspark
1

1 Answers

2
votes

I don't understand why would one work and not the other one ... Is there something to do with the runtime setting before the JVM ?

More or less. IPython configuration you've shown executes python/pyspark/shell.py which creates SparkContext (and some other stuff) and creates a JVM instance. When you create another context later it is using the same JVM and parameters like spark.driver.extraClassPath won't be used.

There are multiple ways can you handle this including passing arguments using PYSPARK_SUBMIT_ARGS or setting spark.driver.extraClassPath in $SPARK_HOME/conf/spark-defaults.conf.

Alternatively you can add following lines to 00-pyspark-setup.py before shell.py is executed:

packages = "mysql:mysql-connector-java:5.1.37"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--packages {0} pyspark-shell".format(
    packages
)

Setting --driver-class-path / --driver-library-path there should work as well.