My cluster has 6 nodes, each with 2 cores. I have a Spark job saving a Parquet file of the size of ~150MB to HDFS. If I repartition my dataframe to 6 partitions before saving, Drill queries are actually 30-40% slower than when I repartition it to 1 partition. Why is that? Is it expected? Can it indicate an issue with my setup?
Update
Results of the same SQL query in seconds (3 runs per number of partitions)
1 partition: 1.238, 1.29, 1.404
2 partitions: 1.286 1.175 1.259
3 partitions: 1.699 1.8 1.7
6 partitions: 2.223, 1.96, 1.772
12 partitions: 1.311, 1.335, 1.339
24 partitions: 1.261 1.302 1.235
48 partitions: 1.664 1.757 2.133
As you can see 1, 2, 12 and 24 partitions are fast. 3, 6 and 48 partitions are very clearly slower. What could be causing that?