0
votes

Here is an example snippet of my dataset. I have multiple data points for each month in each year , over a number of years. I want to get the latest row in each month in each year

Ex:

     data       value
   2019-01-1     10
   2019-01-7     20
   2019-01-28    30
                 
   2019-03-4     50
   2019-03-17    40 
   2019-03-28    20

   2020-04-16    80 
   2020-04-18    60
   2020-04-30    40

   2020-05-16    90 
   2020-05-18    10
   2020-05-30    80

Expected output

data         value
2019-01-28    30
2019-03-28    20
2020-04-30    40
2020-05-30    80

I would like to achieve this with flask-sqlalchemy. I'm thinking I'll have to group by month maybe? I'm quite new to sql queries. Any help would be appreciated!

Edit : I've tried this :

GROUP BY EXTRACT(year FROM date),
EXTRACT(month FROM date) 

but this gives me the first row of each month , I want the last row, i.e the latest row

2
I'll have to group by month maybe? By year and month.Akina
Thank you ! I tried this GROUP BY EXTRACT(year FROM date), EXTRACT(month FROM date) HAVING MAX(date) ORDER BY EXTRACT(YEAR FROM date) but I'm not able to get the last row of the month(Which is the required row). I'm getting the first row of the month.riddhi athreya

2 Answers

1
votes

The solution for MySQL.

SELECT t1.*
FROM table AS t1
JOIN ( SELECT MAX(data) AS data
       FROM table t2
       GROUP BY DATE_FORMAT(data, '%Y%m') ) t3 USING (data)

The query assimes that data column have no duplicates (is defined as UNIQUE).


GROUP BY expression may be GROUP BY EXTRACT(year FROM data), EXTRACT(month FROM data) as you have tried... or even GROUP BY data DIV 100. It doesn't matter.

0
votes

One method uses window functions:

select t.*
from (select t.*,
             row_number() over (partition by year(data), month(data) order by data desc) as seqnum
      from t
     ) t
where seqnum = 1;

You can also use a correlated subquery:

select t.*
from t
where t.data = (select max(t2.data)
                from t t2
                where year(t2.data) = year(t.data) and
                      month(t2.data) = month(t.data)
               );

Often, the correlated subquery performs better. In this case, though, it cannot (easily) be optimized using indexes, so I suspect that window functions are faster.