Getting some error while performing map/reduce on pyspark RDD

Question

I am just trying to learn the PySpark, but confused about the difference between the following two RDDs, I know one is type set and one is list but both are RDDs

rdd = sc.parallelize([('a', 1), ('b', 1), ('a', 3)])
type(rdd)

and

rdd = sc.parallelize(['a, 1', 'b, 1', 'a, 3'])
type(rdd)

Code for processing map and reduce functions:

priceMap= s.map(lambda o: (o.split(",")[0], float(o.split(",")[1])))
priceMap.reduceByKey(add).take(10)

I can easily perform the map/reduce function on the second rdd data, but when I try to perform the map or reduce I get the following error: so how can we convert the first rdd to second rdd data, or if there is any way to resolve the following error please help. thanks

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 162.0 failed 1 times, most recent failure: Lost task 0.0 in stage 162.0 (TID 3850, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):

The traceback suggests there is a python exception. Please show your map/reduce code - the error probably stems from there — mck
@mck yes, I updated the two lines of code for the map and reduce, could you please help me there. Thanks — id101112
put rdd name instead of s so it will be rdd.map(.... and add is the function, which you can import, following is the library, please import it...... from operator import add — id101112

mck mck · Accepted Answer · 2020-11-13T14:36:41

For the first rdd, you can replace the map function:

rdd = sc.parallelize([('a', 1), ('b', 1), ('a', 3)])
rdd.map(lambda o: (o[0], float(o[1]))).reduceByKey(add).collect()

That's because split only works with the strings but not tuples.

Getting some error while performing map/reduce on pyspark RDD

1 Answers