1
votes

I am new to spark/mongodb and I am trying to use mongo-spark-connector to connect to mongo from pyspark following the instructions here. I start pyspark with the command

`pyspark \
--conf 'spark.mongodb.input.uri=mongodb://127.0.0.1/mydb.mytable?readPreference=primaryPreferred' \ 
--conf 'spark.mongodb.output.uri=mongodb://127.0.0.1/mydb.mytable' \ 
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.1`

Which gives the following on startup:

`SLF4J: Class path contains multiple SLF4J bindings.
 SLF4J: Found binding in [jar:file:/usr/local/spark-2.4.4-bin-hadoop2.7/jars/slf4j log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Ivy Default Cache set to: /home/mmr/.ivy2/cache
The jars for the packages stored in: /home/user_name/.ivy2/jars
:: loading settings :: url = jar:file:/usr/local/spark-2.4.4-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.mongodb.spark#mongo-spark-connector_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-18ec2360-9f44-414c-a1de-11f629819aec;1.0
    confs: [default]
    found org.mongodb.spark#mongo-spark-connector_2.11;2.4.1 in central
    found org.mongodb#mongo-java-driver;3.10.2 in central
    [3.10.2] org.mongodb#mongo-java-driver;[3.10,3.11)
:: resolution report :: resolve 1360ms :: artifacts dl 3ms
    :: modules in use:
    org.mongodb#mongo-java-driver;3.10.2 from central in [default]
    org.mongodb.spark#mongo-spark-connector_2.11;2.4.1 from central in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   2   |   1   |   0   |   0   ||   2   |   0   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-18ec2360-9f44-414c-a1de-11f629819aec
    confs: [default]
    0 artifacts copied, 2 already retrieved (0kB/4ms)
20/01/24 00:21:29 WARN Utils: Your hostname, user_name-Machine resolves to a loopback address: 127.0.1.1; using 192.168.1.18 instead (on interface wlan0)
20/01/24 00:21:29 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/01/24 00:21:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".`

And I get the following error when I run >>> df = spark.read.format("mongo").load():

`Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/usr/local/spark/python/pyspark/sql/readwriter.py", line 172, in load
 return self._df(self._jreader.load())
 File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
 File "/usr/local/spark/python/pyspark/sql/utils.py", line 63, in deco
 return f(*a, **kw)
 File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o39.load.
: java.lang.NoSuchMethodError: com.mongodb.MongoClient.<init>(Lcom/mongodb/MongoClientURI;Lcom/mongodb/MongoDriverInformation;)V
    at com.mongodb.spark.connection.DefaultMongoClientFactory.create(DefaultMongoClientFactory.scala:49)
    at com.mongodb.spark.connection.MongoClientCache.acquire(MongoClientCache.scala:55)
    at com.mongodb.spark.MongoConnector.acquireClient(MongoConnector.scala:242)
    at com.mongodb.spark.MongoConnector.withMongoClientDo(MongoConnector.scala:155)
    at com.mongodb.spark.MongoConnector.withDatabaseDo(MongoConnector.scala:174)
    at com.mongodb.spark.MongoConnector.hasSampleAggregateOperator(MongoConnector.scala:237)
    at com.mongodb.spark.rdd.MongoRDD.hasSampleAggregateOperator$lzycompute(MongoRDD.scala:221)
    at com.mongodb.spark.rdd.MongoRDD.hasSampleAggregateOperator(MongoRDD.scala:221)
    at com.mongodb.spark.sql.MongoInferSchema$.apply(MongoInferSchema.scala:68)
    at com.mongodb.spark.sql.DefaultSource.constructRelation(DefaultSource.scala:97)
    at com.mongodb.spark.sql.DefaultSource.createRelation(DefaultSource.scala:50)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)`

Specs:

OS: Ubuntu 18.04

java: openjdk 8

spark: 2.4.4

mongo: 4.2.2

scala: 2.11.12

mongo java driver: 3.12

I have tried using Orace java 8, and switching the mongo driver to 3.10.2.

1

1 Answers

3
votes

The first error is happening due to conflicting slf4j logger dependency. The Spark mongo connector jar lists slf4j as dependencty. See maven package info. However this is just a warning and spark picks the first one available. It seems that this jar is installed twice on your system. One coming from spark package and one from hadoop. Mongo-connector is listing this as provided dependency and spark uses whatever is one on the system.

Normally one could excluded jars with

--exclude-packages Comma-separated list of groupId:artifactId, to exclude while resolving the dependencies provided in --packages to avoid dependency conflicts.

e.g.

--exclude-packages org.slf4j:slf4j-api

However i do not think this is an issue here.

The second error is telling, that such MongoClient constructor method does not exist. MongoClient is a java package dependency of mongo spark connector. Either it was not properly loaded at all. Or you somehow passing the conf options wrong, which ends up calling the MongoClient constructor with not proper arguments (different amount or wrong types).

I see you use different qouting and backticks around the command. You also write you have tried to install a java mongo driver. Have you placed a jar somewhere on the classpath. This is not needed. The --packages argument resolves dependencies from maven. mongo-spark-connector depends on mongo-driver and should resolve it for you. See maven info and source. This dependency is included (in contrast to provided slf4j)

Try pasting the exact command below into your shell. Do not install mongo java driver manually.

pyspark \
--conf "spark.mongodb.input.uri=mongodb://127.0.0.1/mydb.mytable?readPreference=primaryPreferred" \
--conf "spark.mongodb.output.uri=mongodb://127.0.0.1/mydb.mytable" \
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.1

When i run this command there a 2 jar installed automatically on ~/.ivy2/cache

org.mongodb.spark_mongo-spark-connector_2.11-2.4.1.jar
org.mongodb_mongo-java-driver-3.10.2.jar

no conflicting slf4j is installed. The jars also do not contain any other dependent code from other packages. you can inspect the classed with unzip -l <jar-file-name>.jar