7
votes

I'm having trouble with calculating the median of a list of values, not the average.

I found this article Simple way to calculate median with MySQL

It has a reference to the following query which I don't understand properly.

SELECT x.val from data x, data y
GROUP BY x.val
HAVING SUM(SIGN(1-SIGN(y.val-x.val))) = (COUNT(*)+1)/2

If I have a time column and I want to calculate the median value, what do the x and y columns refer to?

7
Note that the solution you mention will not find the median if there are duplicate values. (It fails when the median itself has duplicates)Mark Biesheuvel
I honestly don't understand how MySQL is used by millions of people and have been around for decades but doesn't have a function to calculate a median. Are there any other data-centric systems that haven't implemented math that is usually taught to 9–10 year olds in 4th grade?Monica Heddneck

7 Answers

11
votes

I propose a faster way.

Get the row count:

SELECT CEIL(COUNT(*)/2) FROM data;

Then take the middle value in a sorted subquery:

SELECT max(val) FROM (SELECT val FROM data ORDER BY val limit @middlevalue) x;

I tested this with a 5x10e6 dataset of random numbers and it will find the median in under 10 seconds.

This will find an arbitrary percentile by replacing the COUNT(*)/2 with COUNT(*)*n where n is the percentile (.5 for median, .75 for 75th percentile, etc).

2
votes

val is your time column, x and y are two references to the data table (you can write data AS x, data AS y).

EDIT: To avoid computing your sums twice, you can store the intermediate results.

CREATE TEMPORARY TABLE average_user_total_time 
      (SELECT SUM(time) AS time_taken 
            FROM scores 
            WHERE created_at >= '2010-10-10' 
                    and created_at <= '2010-11-11' 
            GROUP BY user_id);

Then you can compute median over these values which are in a named table.

EDIT: Temporary table won't work here. You could try using a regular table with "MEMORY" table type. Or just have your subquery that computes the values for the median twice in your query. Apart from this, I don't see another solution. This doesn't mean there isn't a better way, maybe somebody else will come with an idea.

1
votes

First try to understand what the median is: it is the middle value in the sorted list of values.

Once you understand that, the approach is two steps:

  1. sort the values in either order
  2. pick the middle value (if not an odd number of values, pick the average of the two middle values)

Example:

Median of 0 1 3 7 9 10: 5 (because (7+3)/2=5)
Median of 0 1 3 7 9 10 11: 7 (because 7 is the middle value)

So, to sort dates you need a numerical value; you can get their time stamp (as seconds elapsed from epoch) and use the definition of median.

1
votes

Finding median in mysql using group_concat

Query:

SELECT
    IF(count%2=1,
       SUBSTRING_INDEX(substring_index(data_str,",",pos),",",-1),
       (SUBSTRING_INDEX(substring_index(data_str,",",pos),",",-1) 
         + SUBSTRING_INDEX(substring_index(data_str,",",pos+1),",",-1))/2) 
    as median 
FROM (SELECT group_concat(val order by val) data_str,
      CEILING(count(*)/2) pos,
      count(*) as count from data)temp;

Explanation:

Sorting is done using order by inside group_concat function

Position(pos) and Total number of elements (count) is identified. CEILING to identify position helps us to use substring_index function in the below steps.

Based on count, even or odd number of values is decided.

  • Odd values: Directly choose the element belonging to the pos using substring_index.
  • Even values: Find the element belonging to the pos and pos+1, then add them and divide by 2 to get the median.

Finally the median is calculated.

1
votes

If you have a table R with a column named A, and you want the median of A, you can do as follows:

SELECT A FROM R R1
WHERE ( SELECT COUNT(A) FROM R R2 WHERE R2.A < R1.A ) = ( SELECT COUNT(A) FROM R R3 WHERE R3.A > R1.A )

Note: This will only work if there are no duplicated values in A. Also, null values are not allowed.

1
votes

Simplest ways me and my friend have found out... ENJOY!!

SELECT count(*) INTO @c from station;
select ROUND((@c+1)/2) into @final; 
SELECT round(lat_n,4) from station a where @final-1=(select count(lat_n) from station b where b.lat_n > a.lat_n);
0
votes

Here is a solution that is easy to understand. Just replace Your_Column and Your_Table as per your requirement.

SET @r = 0;

SELECT AVG(Your_Column)
FROM (SELECT (@r := @r + 1) AS r, Your_Column FROM Your_Table ORDER BY Your_Column) Temp
WHERE
    r = (SELECT CEIL(COUNT(*) / 2) FROM Your_Table) OR
    r = (SELECT FLOOR((COUNT(*) / 2) + 1) FROM Your_Table)

Originally adopted from this thread.