I have a small parquet file (7.67 MB) in HDFS, compressed with snappy. The file has 1300 rows and 10500 columns, all double values. When I create a data frame from the parquet file and perform a simple operation like count, it takes 18 seconds.
scala> val df = spark.read.format("parquet").load("/path/to/parquet/file")
df: org.apache.spark.sql.DataFrame = [column0_0: double, column1_1: double ... 10498 more fields]
scala> df.registerTempTable("table")
scala> spark.time(sql("select count(1) from table").show)
+--------+
|count(1)|
+--------+
| 1300|
+--------+
Time taken: 18402 ms
Can anything be done to improve performance of wide files?