I have setup a Spark 1.3.1 application that collects event data. One of the attributes is a timestamp called 'occurredAt'. Im intending to partition the event data into parquet files on a filestore and in accordance with the documentation (https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#partition-discovery) it indicates that time based values are not supported only string and int, so i've split the date into Year, Month, Day values and partitioned as follows
events
|---occurredAtYear=2015
| |---occurredAtMonth=07
| | |---occurredAtDay=16
| | | |---<parquet-files>
...
I then load the parquet file from the root path /events
sqlContext.parquetFile('/var/tmp/events')
Documentation says:
'Spark SQL will automatically extract the partitioning information from the paths'
However my query
SELECT * FROM events where occurredAtYear=2015
Fails miserably saying spark cannot resolve 'occurredAtYear'
I can see the schema for all other aspects of the event and can do queries on those attributes, but printSchema does not list occurredAtYear/Month/Day on the schema at all? What am I missing to get partitioning working appropriately.
Cheers