0
votes

I am trying to connect to snowflake from Pyspark on my local machine.

My code looks as below.

    from pyspark import SparkConf, SparkContext
    from pyspark.sql import SQLContext
    from pyspark.sql.types import *
    from pyspark import SparkConf, SparkContext

    sc = SparkContext("local", "sf_test")
    spark = SQLContext(sc)
    spark_conf = SparkConf().setMaster('local').setAppName('sf_test')

    sfOptions = {
      "sfURL" : "someaccount.some.address",
      "sfAccount" : "someaccount",
      "sfUser" : "someuser",
      "sfPassword" : "somepassword",
      "sfDatabase" : "somedb",
      "sfSchema" : "someschema",
      "sfWarehouse" : "somedw",
      "sfRole" : "somerole",
    }

SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"

I get an error when I run this particular chunk of code.

df = spark.read.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("query","""select * from 
 "PRED_ORDER_DEV"."SALES"."V_PosAnalysis" pos 
    ORDER BY pos."SAPAccountNumber", pos."SAPMaterialNumber" """).load()

Py4JJavaError: An error occurred while calling o115.load. : java.lang.ClassNotFoundException: Failed to find data source: net.snowflake.spark.snowflake. Please find packages at http://spark.apache.org/third-party-projects.html at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)

I have loaded the connector and jdbc jar files and added them to CLASSPATH

pyspark --packages net.snowflake:snowflake-jdbc:3.11.1,net.snowflake:spark-snowflake_2.11:2.5.7-spark_2.4
CLASSPATH = C:\Program Files\Java\jre1.8.0_241\bin;C:\snowflake_jar

I want to be able to connect to snowflake and read data with Pyspark. Any help would be much appreciated!

2

2 Answers

0
votes

To run a pyspark application you can use spark-submit and pass the JARs under the --packages option. I'm assuming you'd like to run client mode so you pass this to the --deploy-mode option and at last you add the name of your pyspark program.

Something like below:

spark-submit --packages net.snowflake:snowflake-jdbc:3.11.1,net.snowflake:spark-snowflake_2.11:2.5.7-spark_2.4 --deploy-mode client spark-snowflake.py
0
votes

Below working script.

You should to create directory jar in root of you project and add two jars:

  • snowflake-jdbc-3.13.4.jar (jdbc driver)
  • spark-snowflake_2.12-2.9.0-spark_3.1.jar (spark connector).

Next you need to understood what is your scala compiler version. I`m using PyCharm, so double click shift and the search for 'scala'. You will see something like scala-compiler-2.12.10.jar. The first digits of the scala-compiler version (in our case 2.12) should be the same as the first digits of spark connector (spark-snowflake_2.12-2.9.0-spark_3.1.jar)

CHECK SCALA COMPILER VERSION BEFORE DOWNLOADING CONNECTOR

from pyspark.sql import SparkSession

sfOptions = {
    "sfURL": "sfURL",
    "sfUser": "sfUser",
    "sfPassword": "sfPassword",
    "sfDatabase": "sfDatabase",
    "sfSchema": "sfSchema",
    "sfWarehouse": "sfWarehouse",
    "sfRole": "sfRole",
}

spark = SparkSession.builder \
    .master("local") \
    .appName("snowflake-test") \
    .config('spark.jars', 'jar/snowflake-jdbc-3.13.4.jar,jar/spark-snowflake_2.12-2.9.0-spark_3.1.jar') \
    .getOrCreate()


SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"

df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
    .options(**sfOptions) \
    .option("query", "select * from some_table") \
    .load()

df.show()