PySpark - Join two RDDs - Cannot join - Too many values to unpack

Question

I've two files in HDFS (very simple):

test:

1,Team1
2,Team2
3,Team3

test2:

11,Player1,Team1
22,Player1,Team2
32,Player1,Team3

and I want join them (by Team* column) to get the following output:

Team1,1,11,Player1
Team3,3,32,Player1

For that, I am using the following code:

test = sc.textFile("/user/cloudera/Tests/test")
test_filter = test.filter(lambda a: a.split(",")[1].upper() == "TEAM1" or a.split(",")[1].upper() == "TEAM2")
test_map = test_filter.map(lambda a: a.upper())
test_map = test_map.map(lambda a: (a.split(",")[1]))
for i in test_map.collect(): print(i)

test2=sc.textFile("/user/cloudera/Tests/test2")
test2_map = test2.map(lambda a: a.upper())
test2_map = test2_map.map(lambda a: (a.split(",")[2], a.split(",")[1]))
for i in test2_map.collect(): print(i)

test_join = test_map.join(test2_map)
for i in test_join.collect(): print(i)

But when I try to see the join RDD i am getting the following error:

  File "/usr/lib/spark/python/pyspark/rdd.py", line 1807, in <lambda>
    map_values_fn = lambda (k, v): (k, f(v))
ValueError: too many values to unpack

    at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
    at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:176)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)

What I am doing wrong?

Thanks!

Shiva Shiva · Accepted Answer · 2020-10-30T06:16:55

Can you display the result set of these two statements: for i in test_map.collect(): print(i) & for i in test2_map.collect(): print(i)

Also Can you try below:

   test = sc.textFile("/user/cloudera/Tests/test")
   test_map = test.map(lambda a:a.upper())
   test_map = test_map.map(lambda a: (a.split(",")[1],a.split(",")[0]))
   for i in test_map.collect(): print(i)

   test2=sc.textFile("/user/cloudera/Tests/test2")
   test2_map = test2.map(lambda a: a.upper())
   test2_map = test2_map.map(lambda a: (a.split(",")[2], a.split(",")[1]))
   for i in test2_map.collect(): print(i)

   test_join = test_map.join(test2_map)
   for i in test_join.collect(): print(i)

PySpark - Join two RDDs - Cannot join - Too many values to unpack

1 Answers