Update: This has been resolved. It was a typo in the URL.
--
I'm trying to read data from Netezza using pyspark on Windows 10 1909.
I can read from it using DbVisualizer no problem. Then I tried running pyspark --driver-class-path <path to nzjdbc.jar> --jars <path to nzjdbc.jar> --master local[*]
(same machine, VPN connection, JDBC driver jar, and all).
I used this code from the pyspark shell:
dataframe = spark.read.format("jdbc").options(
url="jdbc:netezza://<server>:5480/<database>",
dbtable="ADMIN.<table>",
user="***",
password="***",
driver="org.netezza.Driver",
).load()
but this fails for me, with the following stack, after about 10-20 seconds (I also tried adding queryTimeout="300"
, but that didn't make a difference):
"...\AppData\Local\Continuum\miniconda3\envs\spark\lib\site-packages\pyspark\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o41.load.
: org.netezza.error.NzSQLException: Connection timed out: connect
at org.netezza.sql.NzConnection.initSocket(NzConnection.java:2859)
at org.netezza.sql.NzConnection.open(NzConnection.java:293)
at org.netezza.datasource.NzDatasource.getConnection(NzDatasource.java:675)
at org.netezza.datasource.NzDatasource.getConnection(NzDatasource.java:662)
at org.netezza.Driver.connect(Driver.java:155)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$createConnectionFactory$1(JdbcUtils.scala:64)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:56)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:339)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:203)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Unknown Source)
A coworker is able to run the same code from his Mac with no issues (also on VPN).
Is there something in Windows or in Netezza itself that could affect what clients are able connect to Netezza? Or could I be missing something in the pyspark
command?