9
votes

Based on the bigquery query reference, currently Quantiles do not allow any kind of grouping by another column. I am mainly interested in getting medians grouped by a certain column. The only work around I see right now is to generate a quantile query per distinct group member where the group member is a condition in the where clause.

For example I use the below query for every distinct row in column-y if I want to get the desired result.

SELECT QUANTILE( <column-x>, 1001)
FROM <table>
WHERE 
    <column-y> == <each distinct row in column-y>
  1. Does the big query team plan on having some functionality to allow grouping on quantiles in the future?
  2. Is there a better way to get what I am trying to get here?

Thanks

3

3 Answers

7
votes

With the recently announced percentile_cont() window function you can get medians.

Look at the example in the announcement blog post:

http://googlecloudplatform.blogspot.com/2013/06/google-bigquery-bigger-faster-smarter-analytics-functions.html

SELECT MAX(median) AS median, room FROM (
  SELECT percentile_cont(0.5) OVER (PARTITION BY room ORDER BY data) AS median, room
  FROM [io_sensor_data.moscone_io13]
  WHERE sensortype='temperature'
)
GROUP BY room
1
votes

While there are efficient algorithms to compute quantiles they are somewhat memory intensive - trying to do multiple quantile calculations in a single query gets expensive.

  1. There are plans to improve QUANTILES, but I don't know what the timeline is.
  2. Do you need median? Can you filter outliers and do an average of the remainder?
0
votes

If your per-group size is fixed, you may be able to hack it using combination of order, nest and nth. For instance, if there are 9 distinct values of f2 per value of f1, for median:

select f1,nth(5,f2) within record from (
  select f1,nest(f2) f2 from (
    select f1, f2 from table
    group by f1,f2
    order by f2
  ) group by f1
);

Not sure if the sorted order in subquery is guaranteed to survive the second group, but it worked in a simple test I tried.