I have about 3000 parquet files (extension is .gz.parquet) ranging in size from 40 MB to 160 MB stored in HDFS. HDFS block size is 128 MB. Total size of all the files is about 360 GB. All the files are in the same directory i.e. data is not partitioned.
I start Spark SQL Thrift Server on 16 nodes and specify 160 GB memory for each executor. Each node has 256 GB of memory and 32 cores.
I create an external table using "create external table acdata (....) stored as parquet location '/user/ac/data'"
Then I cache the data using "cache table acdata". Doing this creates about 4000 partitions as expected. Data takes up about 1200 GB of memory across the cluster and should all fit in memory since I have 1280 GB (16 * 160 GB / 2) available for caching data. Partitions range from 530 MB to 2 MB. The problem is that the partitions are not evenly distributed across the nodes. Some nodes have 50 GB cached data and others have 80 GB cached data.
I run a "Select" query from a JDBC client. When nodes with less cached data are done processing their local data they start processing data cached on other nodes causing this data to be sent over the network. This causes the select statement to take longer.
Repartitioning the data stored on HDFS is not an option because data is added to this directory every day and size of data each day is different. I would have to repartition ALL the data every day instead of just incrementally adding data to the same directory.
Ideally I would like to figure out a way to distribute the data evenly on all the nodes so all the tasks take the same amount of time.
If that is not possible, I would prefer if nodes only process data that is cached locally. I can get a little performance boost by increasing the value of "spark.locality.wait" but that impacts all tasks.