0
votes

I use pyspark streaming with enabled checkpoints. The first launch is successful but when restart crashes with the error:

INFO scheduler.DAGScheduler: ResultStage 6 (runJob at PythonRDD.scala:441) failed in 1,160 s due to Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 86, h-1.e-contenta.com, executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File"/data1/yarn/nm/usercache/appcache/application_1481115309392_0229/container_1481115309392_0229_01_000003/pyspark.zip/pyspark/worker.py", line 163, in main func, profiler, deserializer, serializer = read_command(pickleSer, infile) File"/data1/yarn/nm/usercache/appcache/application_1481115309392_0229/container_1481115309392_0229_01_000003/pyspark.zip/pyspark/worker.py", line 56, in read_command command = serializer.loads(command.value) File"/data1/yarn/nm/usercache/appcache/application_1481115309392_0229/container_1481115309392_0229_01_000003/pyspark.zip/pyspark/serializers.py", line 431, in loads return pickle.loads(obj, encoding=encoding) ImportError: No module named ...

Python modules added via spark context addPyFile()

def create_streaming():
"""
Create streaming context and processing functions
:return: StreamingContext
"""
sc = SparkContext(conf=spark_config)
zip_path = zip_lib(PACKAGES, PY_FILES)
sc.addPyFile(zip_path)
ssc = StreamingContext(sc, BATCH_DURATION)

stream = KafkaUtils.createStream(ssc=ssc, zkQuorum=','.join(ZOOKEEPER_QUORUM),
                                      groupId='new_group',
                                      topics={topic: 1})

stream.checkpoint(BATCH_DURATION)
stream = stream \
    .map(lambda x: process(ujson.loads(x[1]), geo_data_bc_value)) \
    .foreachRDD(lambda_log_writer(topic, schema_bc_value))

ssc.checkpoint(STREAM_CHECKPOINT)
return ssc

if __name__ == '__main__':
ssc = StreamingContext.getOrCreate(STREAM_CHECKPOINT, lambda: create_streaming())
ssc.start()
ssc.awaitTermination()
1
where do you set ssc.addPyFile? In ssc.getOrCreate or after ssc.getOrCreate?Zhang Tong
In a method that returns the streaming context:sutugin
try to set additional ssc.addPyFile after ssc = StreamingContext.getOrCreateZhang Tong
Only SparkContext has a method addPyFile, StreamingContext does not have it. Added code examplesutugin

1 Answers

0
votes

Sorry it is my mistake.

Try this :

if __name__ == '__main__':
    ssc = StreamingContext.getOrCreate('', None)
    ssc.sparkContext.addPyFile()

    ssc.start()
    ssc.awaitTermination()