Google Big Query: New Column of Aggregate Based On Condition of Current Row

Question

Using the Google Big Query database bigquery-public-data.crypto_ethereum_classic.transactions as reference.

For each transaction row, I want to calculate the count of all transactions to the same address that occurred before that transaction, and sum of the gas usage of them. I am sure I can do this with a join as I have tried and Google accepts my old query, but since there is so much data as a result of the (inner) join, there is almost always a "quota limit exceeded" error. At the same time, I think a subquery solution is inefficient, as it is querying almost the same thing in both aggregate functions.

In a perfect world the query would use something like a join to create a temporary table with all columns I need (transaction_hash, receipt_gas_used, to_address, block_timestamp), according to the conditions (where to_address = table_1.to_address and block_timestamp < table_1.block_timestamp), where I can then perform the aggregate functions on the columns of that table.

What I have so far and what I'm looking for is something like...:

SELECT 
    table_1.*,
    COUNT(
        DISTINCT IF(block_timestamp < table_1.block_timestamp and to_address = table_1.to_address, `hash`, NULL)
    ) as txn_count,
    SUM(
        IF(block_timestamp < table_1.block_timestamp and to_address = table_1.to_address, `receipt_gas_used`, NULL)
    ) as total_gas_used
from 
    `bigquery-public-data.crypto_ethereum_classic.transactions` as table_1 
where block_number >= 3000000 and block number <= 3500000 #just to subset the data a bit

Gordon Linoff Gordon Linoff · Accepted Answer · 2019-12-04T20:34:09

I think you want window functions:

select t.*,
       row_number() over (partition by to_address order by block_timestamp) as txn_seqnum,
       sum(receipt_gas_used) over (partition by to_address order by block_timestamp) as total_gas_used
from `bigquery-public-data.crypto_ethereum_classic.transactions` as t 
where block_number >= 3000000 and block number <= 3500000 #just to subset the

If you really have ties and need the distinct, then use dense_rank() instead of row_number().

Google Big Query: New Column of Aggregate Based On Condition of Current Row

1 Answers