4
votes

I created rdd = sc.parallelize(range(200)). Then I set rdd2 = rdd.cartesian(rdd). I found that as expected rdd2.count() was 40,000. However, when I set rdd3 = rdd2.cartesian(rdd), rdd3.count() was less than 20,000. Why is this the case?

1
This is quite strange indeed. I just tried the same sequence of operations in Scala and it resulted in an RDD with 8M items. In pyspark, for me, rdd3.count() resulted in 3200. Maybe it has something to do with the number of partitions? - Daniel de Paula
More likely the way it is implemented. cartesian does some ugly serde tricks to reuse Java code. If am pretty sure you can open JIRA for that. - zero323
Even on Databricks cloud I see the same issue: databricks-prod-cloudfront.cloud.databricks.com/public/… Did someone open a jira for that issue? [1]: i.stack.imgur.com/vYoiH.png - kmader

1 Answers

1
votes

This is a bug tracked by SPARK-16589.