0
votes


Am trying to connect spark with oracle database in pyspark, but am getting driver error, could any one please help me on that. Am new to Spark and just started learning. Below is my code,

import pyspark

sc = pyspark.SparkContext('local[*]')
SqlContext = pyspark.SQLContext(sc)
Driver = 'C:\Hadoop\drivers\ojdbc14.jar'
OracleConnection = 'jdbc:oracle:thin:hr/hr@localhost:1521/xe'
Query = 'select * from employees'
OrcDb = SqlContext.read.format('jdbc') \
    .option('url', OracleConnection) \
    .option('dbtable', Query) \
    .option('driver', Driver) \
    .load()

OrcDb.printSchema()

Below is the error,

File "C:/Users/Macaulay/PycharmProjects/Spark/SparkSqlOracle.py", line 8, in OrcDb = SqlContext.read.format('jdbc') \ File "C:\Hadoop\Spark\spark-3.0.0-preview2-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\readwriter.py", line 166, in load File "C:\Hadoop\Spark\spark-3.0.0-preview2-bin-hadoop2.7\python\lib\py4j-0.10.8.1-src.zip\py4j\java_gateway.py", line 1285, in call File "C:\Hadoop\Spark\spark-3.0.0-preview2-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\utils.py", line 98, in deco File "C:\Hadoop\Spark\spark-3.0.0-preview2-bin-hadoop2.7\python\lib\py4j-0.10.8.1-src.zip\py4j\protocol.py", line 326, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o29.load. : java.lang.ClassNotFoundException: C:\Hadoop\drivers\ojdbc14.jar at java.net.URLClassLoader$1.run(Unknown Source) at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:45) at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1(JDBCOptions.scala:99) at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1$adapted(JDBCOptions.scala:99) at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$Lambda$729/1345147223.apply(Unknown Source) at scala.Option.foreach(Option.scala:407) at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:99) at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:35) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:339) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:240) at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:229) at org.apache.spark.sql.DataFrameReader$$Lambda$719/1893144191.apply(Unknown Source) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:229) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:179) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Unknown Source)

1
Is the driver jar in clasdpath?Salim
Path of the driver is also provided in the codeJim Macaulay

1 Answers

0
votes

Found the issue. JDBC driver should be placed in the Spark Jar directory and instead of providing the path of the driver, we have to provide the service name of the driver. This approach resolved the issue.

Below is the code,

import pyspark
from pyspark.sql.session import SparkSession

sc = pyspark.SparkContext('local[*]')
SqlContext = pyspark.SQLContext(sc)
spark = SparkSession(sc)
Driver = 'oracle.jdbc.driver.OracleDriver'   # Driver's service name
OracleConnection = 'jdbc:oracle:thin:@//localhost:1521/xe'
User = 'hr'
Password = 'hr'
Query = 'select * from employees'
OrcDb = spark.read.format('jdbc') \
    .option('url', OracleConnection) \
    .option('dbtable', Query) \
    .option('user', User) \
    .option('Password', Password) \
    .option('driver', Driver) \
    .load()
OrcDb.printSchema()