Spark very slow performance with wide dataset

Question

I have a small parquet file (7.67 MB) in HDFS, compressed with snappy. The file has 1300 rows and 10500 columns, all double values. When I create a data frame from the parquet file and perform a simple operation like count, it takes 18 seconds.

scala> val df = spark.read.format("parquet").load("/path/to/parquet/file")
df: org.apache.spark.sql.DataFrame = [column0_0: double, column1_1: double ... 10498 more fields]

scala> df.registerTempTable("table")

scala> spark.time(sql("select count(1) from table").show)
+--------+
|count(1)|
+--------+
|    1300|
+--------+

Time taken: 18402 ms

Can anything be done to improve performance of wide files?

Unknown Unknown · Accepted Answer · 2018-09-12T05:53:20

Hey Glad you are here on the community,

Count is a lazy operation.Count,Show all these operations are costly in spark as they run over each and every record so using them will always take a lot of time instead you can write the results back to a file or database to make it fast, if you want to check out the result you can use DF.printSchema() A simple way to check if a dataframe has rows, is to do a Try(df.head). If Success, then there's at least one row in the dataframe. If Failure, then the dataframe is empty.

Spark very slow performance with wide dataset

3 Answers