I have a file in hdfs which is distributed across the nodes in the cluster.
I'm trying to get a random sample of 10 lines from this file.
in the pyspark shell, I read the file into an RDD using:
>>> textFile = sc.textFile("/user/data/myfiles/*")
and then I want to simply take a sample... the cool thing about Spark is that there are commands like takeSample
, unfortunately I think I'm doing something wrong because the following takes a really long time:
>>> textFile.takeSample(False, 10, 12345)
so I tried creating a partition on each node, and then instructing each node to sample that partition using the following command:
>>> textFile.partitionBy(4).mapPartitions(lambda blockOfLines: blockOfLines.takeSample(False, 10, 1234)).first()
but this gives an error ValueError: too many values to unpack
:
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/spark/python/pyspark/worker.py", line 77, in main
serializer.dump_stream(func(split_index, iterator), outfile)
File "/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/spark/python/pyspark/serializers.py", line 117, in dump_stream
for obj in iterator:
File "/opt/cloudera/parcels/CDH-5.0.2-1.cdh5.0.2.p0.13/lib/spark/python/pyspark/rdd.py", line 821, in add_shuffle_key
for (k, v) in iterator:
ValueError: too many values to unpack
How can I sample 10 lines from a large distributed data set using spark or pyspark?