I am writing a parquet file to hdfs using the following command:
df.write.mode(SaveMode.Append).partitionBy(id).parquet(path)
Afterwards I am reading and filtering the file like this:
val file = sqlContext.read.parquet(folder)
val data = file.map(r => Row(r.getInt(4).toString, r.getString(0), r.getInt(1),
r.getLong(2), r.getString(3)))
val filteredData = data.filter(x => x.thingId.equals("1"))
filteredData.collect()
I would expect, that Spark would leverage the partitioning of the file and only read the partition of "thingId = 1". In Fact, Spark does read all partitions of the file and not only the filtered one (partition with thingId=1). If I look in the logs, I can see that it does read everything:
16/03/21 01:32:33 INFO ParquetRelation: Reading Parquet file(s) from hdfs://sandbox.hortonworks.com/path/id=1/part-r-00000-b4e27b02-9a21-4915-89a7-189c30ca3fe3.gz.parquet 16/03/21 01:32:33 INFO ParquetRelation: Reading Parquet file(s) from hdfs://sandbox.hortonworks.com/path/id=42/part-r-00000-b4e27b02-9a21-4915-89a7-189c30ca3fe3.gz.parquet 16/03/21 01:32:33 INFO ParquetRelation: Reading Parquet file(s) from hdfs://sandbox.hortonworks.com/path/id=17/part-r-00000-b4e27b02-9a21-4915-89a7-189c30ca3fe3.gz.parquet 16/03/21 01:32:33 INFO ParquetRelation: Reading Parquet file(s) from hdfs://sandbox.hortonworks.com/path/0833/id=33/part-r-00000-b4e27b02-9a21-4915-89a7-189c30ca3fe3.gz.parquet 16/03/21 01:32:33 INFO ParquetRelation: Reading Parquet file(s) from hdfs://sandbox.hortonworks.com/path/id=26/part-r-00000-b4e27b02-9a21-4915-89a7-189c30ca3fe3.gz.parquet 16/03/21 01:32:33 INFO ParquetRelation: Reading Parquet file(s) from hdfs://sandbox.hortonworks.com/path/id=12/part-r-00000-b4e27b02-9a21-4915-89a7-189c30ca3fe3.gz.parquet
Is there anything I am missing? When I look at the docs, Spark should know based on the filter, that it should only read Partition with thingID=1. Does anyone of you have an idea whats the problem?