0
votes

I found that using sparksql (from pyspark) to query a DataFrame generated from multiple parquet files are much less efficient than the same amount of data generated from a single parquet file, though the filter condition is not the first column (so I guess it is no the index stuff). Does anyone know why will this happen? And how could I make the query response time as efficient as the latter?

# the parquet files are stored on hdfs
hdfs_path = 'hdfs://localhost:9000/Test/'
paths = [hdfs_path+'p1.parquet', hdfs_path+'p2.parquet', hdfs_path+'p3.parquet']

# laod the parquet files into a DataFrame
dfs = sqlContext.read.parquet(*paths)

# query from the DataFrame
sql = '_c2 > 4 and _c2 < 10'
query_result = dfs.filter(sql).collect() # slower

# now write the DataFrame as 1 parquet and reload
dfs.write.mode('overwrite').parquet(hdfs_path+'all_in_one.parquet')
df = sqlContext.read.parquet(hdfs_path+'all_in_one.parquet')

# the query response time is much faster
query_result = df.filter(sql).collect() # faster

1

1 Answers

0
votes

Answer my own question:

It seems the main reason is each file is too small, the metadata checking time dominate the query response time. When the file is large enough (i.e., at least greater than the block size, which it should be), the query response time will be dominated by the results IO.