10
votes

I am relatively new to Apache Spark and Python and was wondering how to get the size of a RDD. I have a RDD that looks like this:

[[‘ID: 6993.1066',
  'Time: 15:53:43',
  'Lab: West',
  'Lab-Tech: Nancy McNabb, ',
  '\tBob Jones, Harry Lim, ',
  '\tSue Smith, Will Smith, ',
  '\tTerry Smith, Nandini Chandra, ',
  ]]

Is there a method or function in pyspark that can give the size how many tuples in a RDD? The one above has 7.

Scala has something like: myRDD.length.

1
i think you can use len(RDD values) function for this.. - HuntsMan
can't you just do rdd.count()? - Ramesh Maharjan

1 Answers

9
votes

For RDD individual element's size, this appears to be the way

>>> rdd = sc.parallelize([(1,2,'the'),(5,2,5),(1,1,'apple')])
>>> rdd.map(lambda x: len(x)).collect()
[3, 3, 3]

For overall element count within RDD

>>> rdd.count()
3