In spark, how to view n row of data without scanning of whole partition in spark

Question

I have a hive parquet table (external table on top of s3) which contains 6k partition. In data exploration we want to view the sample data, lets say 1/2/10 record without performing any transformation or action.

Is there a way to restrict only one partition and limit/show n records instead of going through 6k partition(if cluster is small it will take huge amount of time to just print 10 rows). I thought about mapPartitionsWithIndex but it still go through all partitions

def mpwi(index: Int, iter: Iterator[Row]): Iterator = {
  if (index == 1) iter
  else Iterator()
}

Selnay Selnay · Accepted Answer · 2019-08-09T15:44:53

You should try limit. For example:

val df = spark.sql("select * from your_table")
df.limit(10).show // Retrieves only 10 rows

This should be more performant than loading the full table. If you are not getting the expected performance boost, please paste the logical/physical queryplan here so that we can analyse it. You can do that with df.explain(true)

In spark, how to view n row of data without scanning of whole partition in spark

1 Answers