I created rdd = sc.parallelize(range(200)). Then I set rdd2 = rdd.cartesian(rdd). I found that as expected rdd2.count() was 40,000. However, when I set rdd3 = rdd2.cartesian(rdd), rdd3.count() was less than 20,000. Why is this the case?
This is quite strange indeed. I just tried the same sequence of operations in Scala and it resulted in an RDD with 8M items. In pyspark, for me, rdd3.count() resulted in 3200. Maybe it has something to do with the number of partitions?
- Daniel de Paula
More likely the way it is implemented. cartesian does some ugly serde tricks to reuse Java code. If am pretty sure you can open JIRA for that.
- zero323
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.OkRead more
rdd3.count()resulted in 3200. Maybe it has something to do with the number of partitions? - Daniel de Paulacartesiandoes some ugly serde tricks to reuse Java code. If am pretty sure you can open JIRA for that. - zero323