2
votes

I am using pyspark to read streaming data from Kafka and then I want to sink that data to mongodb. I have included all the required packages, but it throws the error that
UnsupportedOperationException: Data source com.mongodb.spark.sql.DefaultSource does not support streamed writing

The following links are not related to my question

Writing to mongoDB from Spark

Spark to MongoDB via Mesos

Here is the full error stack trace

Traceback (most recent call last): File "/home/b3ds/kafka-spark.py", line 85, in .option("com.mongodb.spark.sql.DefaultSource","mongodb://localhost:27017/twitter.test")\ File "/home/b3ds/hdp/spark/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 827, in start File "/home/b3ds/hdp/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call File "/home/b3ds/hdp/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/home/b3ds/hdp/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o122.start. : java.lang.UnsupportedOperationException: Data source com.mongodb.spark.sql.DefaultSource does not support streamed writing at org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:287) at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:272) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748)

Here is my pyspark code

from __future__ import print_function
import sys
from pyspark.sql.functions import udf
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql.types import StructType
from pyspark.sql.types import *
import json
from pyspark.sql.functions import struct
from pyspark.sql.functions import *
import datetime

json_schema = StructType([
  StructField("twitterid", StringType(), True),
  StructField("created_at", StringType(), True),
  StructField("tweet", StringType(), True),
  StructField("screen_name", StringType(), True)
])

def parse_json(df):
    twitterid   = json.loads(df[0])['id']
    created_at  = json.loads(df[0])['created_at']
    tweet       = json.loads(df[0])['text']
    tweet       = json.loads(df[0])['text']
    screen_name = json.loads(df[0])['user']['screen_name']
    return [twitterid, created_at, tweet, screen_name]

def convert_twitter_date(timestamp_str):
    output_ts = datetime.datetime.strptime(timestamp_str.replace('+0000 ',''), '%a %b %d %H:%M:%S %Y')
    return output_ts

if __name__ == "__main__":

        spark = SparkSession\
                        .builder\
                        .appName("StructuredNetworkWordCount")\
                        .config("spark.mongodb.input.uri","mongodb://192.168.1.16:27017/twitter.test")\
                        .config("spark.mongodb.output.uri","mongodb://192.168.1.16:27017/twitter.test")\
                        .getOrCreate()
        events = spark\
                        .readStream\
                        .format("kafka")\
                        .option("kafka.bootstrap.servers", "localhost:9092")\
                        .option("subscribe", "twitter")\
                        .load()
        events = events.selectExpr("CAST(value as String)")

        udf_parse_json = udf(parse_json , json_schema)
        udf_convert_twitter_date = udf(convert_twitter_date, TimestampType())
        jsonoutput = events.withColumn("parsed_field", udf_parse_json(struct([events[x] for x in events.columns]))) \
                                        .where(col("parsed_field").isNotNull()) \
                                        .withColumn("created_at", col("parsed_field.created_at")) \
                                        .withColumn("screen_name", col("parsed_field.screen_name")) \
                                        .withColumn("tweet", col("parsed_field.tweet")) \
                                        .withColumn("created_at_ts", udf_convert_twitter_date(col("parsed_field.created_at")))

        windowedCounts = jsonoutput.groupBy(window(jsonoutput.created_at_ts, "1 minutes", "15 seconds"),jsonoutput.screen_name)$

        mongooutput = jsonoutput \
                        .writeStream \
                        .format("com.mongodb.spark.sql.DefaultSource")\
                        .option("com.mongodb.spark.sql.DefaultSource","mongodb://localhost:27017/twitter.test")\
                        .start()
        mongooutput.awaitTermination()

I have seen the mongodb documentation which says it supports spark to mongo sink

https://docs.mongodb.com/spark-connector/master/scala/streaming/

1
Did you get it working? I have similar problem.ThReSholD
Hi, Did you solve the problem?Induraj PR

1 Answers

2
votes

I have seen the mongodb documentation which says it supports spark to mongo sink

What documentation claims, is that you can use standard RDD API to write each RDD using legacy Streaming (DStream) API.

It doesn't suggest that MongoDB supports Structured Streaming, and it doesn't. Since you use PySpark, where forEach writer is not accessible, you'll have to wait, until (if ever) MongoDB package is updated to support streaming operations.