2
votes

If I create two rdds like these:

a = sc.parallelize([[1 for j in range(3)] for i in xrange(10**9)])

b = sc.parallelize([[1 for j in xrange(10**9)] for i in range(3)])

When you think about it partitioning first one is intuitive, billion rows are partitioned around workers. But for the second one there are 3 rows and for each row there are billion item.

My question is: For the second line, if I have 2 workers does one row goes to one worker, and the other two rows goes to the other worker?

1

1 Answers

2
votes

Data distribution in Spark is limited to the top level sequence you use to create a RDD.

Depending on a configuration in the second case you'll get at most three non-empty partitions, each assigned to a single worker so in the second scenario 1-2 split is a likely outcome.

Generally speaking small number of elements, especially very large, doesn't fit well into Spark processing model.