0
votes

I have a hive parquet table (external table on top of s3) which contains 6k partition. In data exploration we want to view the sample data, lets say 1/2/10 record without performing any transformation or action.

Is there a way to restrict only one partition and limit/show n records instead of going through 6k partition(if cluster is small it will take huge amount of time to just print 10 rows). I thought about mapPartitionsWithIndex but it still go through all partitions

def mpwi(index: Int, iter: Iterator[Row]): Iterator = {
  if (index == 1) iter
  else Iterator()
}
1
you can use the sample() method of a RDD object.Amin Heydari Alashti
let me try and let you know.SelvamR

1 Answers

0
votes

You should try limit. For example:

val df = spark.sql("select * from your_table")
df.limit(10).show // Retrieves only 10 rows

This should be more performant than loading the full table. If you are not getting the expected performance boost, please paste the logical/physical queryplan here so that we can analyse it. You can do that with df.explain(true)