0
votes

I want to compute the average spending per merchant the last 90 days. I have been doing that with pyspark SQL:

df_spark = df_spark.withColumn("t_unix", F.unix_timestamp(df_spark['date']))

windowSpec = Window.orderBy("t_unix").partitionBy("merchant").rangeBetween(-3 * 30 * 24 * 3600, -1)
average_spending = F.avg(df_spark['amount']).over(windowSpec)

df = df_spark.withColumn("average_spending", average_spending)

df.select('merchant', 'date', "amount", "average_spending").show(5)

+---------+-------------------+-------+----------------+
| merchant|date               |amount |average_spending|
+---------+-------------------+-------+----------------+
| 26      |2017-01-01 01:11:06|  3    |            null|
| 26      |2017-01-01 02:02:15| 54    |             3.0|
| 26      |2017-01-01 02:26:45|  6    |            28.5|
| 26      |2017-01-01 02:40:37|  4    |            21.0|
| 26      |2017-01-01 02:41:51| 85    |           16.75|
+---------+-------------------+-------+----------------+
only showing top 5 rows

And now I want to do it in AWS Athena (Presto).

I tried the query below:

But I got the error message:

Your query has the following error(s):

SYNTAX_ERROR: line 7:24: Unexpected parameters (varchar(3), integer, varchar) for function date_add. Expected: date_add(varchar(x), bigint, date) , date_add(varchar(x), bigint, time) , date_add(varchar(x), bigint, time with time zone) , date_add(varchar(x), bigint, timestamp) , date_add(varchar(x), bigint, timestamp with time zone)

But in date_add('day', -90, "date") I want "date" as the current timestamp of the row and not a static timestamp.

SELECT
   "date",
   "merchant",
   "amount",
   AVG("amount") 
   FROM "table"
   WHERE ("date" BETWEEN date_add('day', -90, "date") and "date")
   GROUP BY "merchant"
   ORDER BY "date"
   LIMIT 5

But I got the error message:

Your query has the following error(s):

SYNTAX_ERROR: line 7:24: Unexpected parameters (varchar(3), integer, varchar) for function date_add. Expected: date_add(varchar(x), bigint, date) , date_add(varchar(x), bigint, time) , date_add(varchar(x), bigint, time with time zone) , date_add(varchar(x), bigint, timestamp) , date_add(varchar(x), bigint, timestamp with time zone)

But in date_add('day', -90, "date") I want "date" as the current timestamp of the row and not a static timestamp.

I did another attempt with:

SELECT
   unix_date,
   merchant,
   amount,
   AVG(amount)
      OVER 
      (  PARTITION BY merchant
         ORDER BY unix_date
         RANGE BETWEEN INTERVAL '90' DAY PRECEDING AND CURRENT ROW
      ) AVG_S
FROM ...;

But I got the error message:

SYNTAX_ERROR: line 5:4: Window frame start value type must be INTEGER or BIGINT(actual interval day to second)

There is a similar unsolved issue here: Presto SQL window aggregate looking back x hours/minutes/seconds

2
What do you mean by "current timestamp of the row"? Have a look at from_unixtime function (prestosql.io/docs/current/functions/datetime.html#from_unixtime). Does it help? - Piotr Findeisen
I am trying to find the function with which I can calculate at any point in time what was the average spending for the last 90 days. A recurring/rolling average for the last 90 days but when I use the above I get an error. Fox example, if I have a list of all the days in a year, I need to be calculating the average of the last 90 days for every day. - Florian
Do you have data point for each merchant and day, or is it sparse? What's the type of date column? - Piotr Findeisen
@PiotrFindeisen yes data point for each merchant and day, and the type of date column is a string - Florian

2 Answers

1
votes

This has been working for me.

CREATE TABLE IF NOT EXISTS full_year_query_parquet
  WITH (format = 'PARQUET',
        parquet_compression = 'SNAPPY',
        external_location='s3://your_s3_bucket/data') AS
SELECT
    a.merchant,
    a.amount,
    a.date,
    avg(preceding.amount)
  FROM "your_table" as a
  JOIN "your_table" as preceding ON a.merchant = preceding.merchant
    AND preceding.date > DATE_ADD('day', -90, a.date)
    AND preceding.date < a.date
  GROUP BY a.merchant, a.amount, a.date
0
votes

If you have one data point for each merchant×date, this can be done easily using Presto's window functions

SELECT
  date, merchant, amount,
  avg(amount) OVER (
    PARTITION BY merchant
    ORDER BY date ASC
    ROWS 89 PRECEDING) -- 89 preceding rows + the current row
  FROM ...
  ORDER BY date ASC -- not necessary, but you likely want the data to be sorted as well

If you have varying number of data points for each merchant and date, you can do something like this:

SELECT
  c.merchant, c.date, c.amount, avg(preceding.amount)
  FROM your_table c
  JOIN your_table preceding ON c.merchant = preceding.merchant
    AND preceding.date BETWEEN (c.date - INTERVAL '89' DAY) AND c.date
  GROUP BY c.merchant, c.date, c.amount