I found that using sparksql (from pyspark) to query a DataFrame generated from multiple parquet files are much less efficient than the same amount of data generated from a single parquet file, though the filter condition is not the first column (so I guess it is no the index stuff). Does anyone know why will this happen? And how could I make the query response time as efficient as the latter?
# the parquet files are stored on hdfs
hdfs_path = 'hdfs://localhost:9000/Test/'
paths = [hdfs_path+'p1.parquet', hdfs_path+'p2.parquet', hdfs_path+'p3.parquet']
# laod the parquet files into a DataFrame
dfs = sqlContext.read.parquet(*paths)
# query from the DataFrame
sql = '_c2 > 4 and _c2 < 10'
query_result = dfs.filter(sql).collect() # slower
# now write the DataFrame as 1 parquet and reload
dfs.write.mode('overwrite').parquet(hdfs_path+'all_in_one.parquet')
df = sqlContext.read.parquet(hdfs_path+'all_in_one.parquet')
# the query response time is much faster
query_result = df.filter(sql).collect() # faster