2
votes

I have a small parquet file (7.67 MB) in HDFS, compressed with snappy. The file has 1300 rows and 10500 columns, all double values. When I create a data frame from the parquet file and perform a simple operation like count, it takes 18 seconds.

scala> val df = spark.read.format("parquet").load("/path/to/parquet/file")
df: org.apache.spark.sql.DataFrame = [column0_0: double, column1_1: double ... 10498 more fields]

scala> df.registerTempTable("table")

scala> spark.time(sql("select count(1) from table").show)
+--------+
|count(1)|
+--------+
|    1300|
+--------+

Time taken: 18402 ms

Can anything be done to improve performance of wide files?

3

3 Answers

3
votes

Hey Glad you are here on the community,

Count is a lazy operation.Count,Show all these operations are costly in spark as they run over each and every record so using them will always take a lot of time instead you can write the results back to a file or database to make it fast, if you want to check out the result you can use DF.printSchema() A simple way to check if a dataframe has rows, is to do a Try(df.head). If Success, then there's at least one row in the dataframe. If Failure, then the dataframe is empty.

0
votes

When operating on the data frame, you may want to consider selecting only those columns that are of interest to you (i.e. df.select(columns...)) before performing any aggregation. This may trim down the size of your set considerably. Also, if any filtering needs to be done, do that first as well.

0
votes

I find this answer which may be helpful to you.

Spark SQL is not suitable to process wide data (column number > 1K). If it's possible, you can use vector or map column to solve this problem.