Suppose I have a DataFrame called transactions
with the following integer columns: year
, month
, day
, timestamp
, transaction_id
.
In [1]: transactions = ctx.createDataFrame([(2017, 12, 1, 10000, 1), (2017, 12, 2, 10001, 2), (2017, 12, 3, 10003, 3), (2017, 12, 4, 10004, 4), (2017, 12, 5, 10005, 5), (2017, 12, 6, 10006, 6)],('year', 'month', 'day', 'timestamp', 'transaction_id'))
In [2]: transactions.show()
+----+-----+---+---------+--------------+
|year|month|day|timestamp|transaction_id|
+----+-----+---+---------+--------------+
|2017| 12| 1| 10000| 1|
|2017| 12| 2| 10001| 2|
|2017| 12| 3| 10003| 3|
|2017| 12| 4| 10004| 4|
|2017| 12| 5| 10005| 5|
|2017| 12| 6| 10006| 6|
+----+-----+---+---------+--------------+
I want to define a function filter_date_range
that returns a DataFrame consisting of transaction rows that fall within some date-range.
>>> filter_date_range(
df = transactions,
start_date = datetime.date(2017, 12, 2),
end_date = datetime.date(2017, 12, 4)).show()
+----+-----+---+---------+--------------+
|year|month|day|timestamp|transaction_id|
+----+-----+---+---------+--------------+
|2017| 12| 1| 10001| 2|
|2017| 12| 1| 10003| 3|
|2017| 12| 1| 10004| 4|
+----+-----+---+---------+--------------+
Assuming the data is held in Hive partitions, partitioned by year
, month
, day
, what is the most efficient way to perform a filter like this that involves date arithmetic? I am looking for a way of doing this in a purely DataFrame-ic way, without resorting to usage of transactions.rdd
, so that Spark can infer that only a subset of partitions actually need to be read.