MySQL: How do I pad out table to fill missing rows with existing data

Question

I've got a large (1GB in cvs file) set of quarterly financial data that I need to pad out to monthly data. Each row has a company identifier and date stamp, but different companies have different reporting dates (Mar, June, September, December vs February, May, August, November).

Table: Source

Co.   |Date      |NPAT   |Debt
A     |31-Dec-09 |123    |4,000
B     |29-Feb-10 |12     |300
A     |31-Mar-10 |200    |4,500
B     |31-May-10 |11     |200
A     |30-Jun-10 |159    |4,300
C     |30-Jun-10 |-30    |4

In the example company A reports in March, June, September and December so I need March figures copied to April and May, June copies to July and August, September to October and November and December's figures copied to January and February. For company B the reporting periods are Feb, May, Aug and Nov.

Using the example above what I need is:

Table: Destination

Co.   |Date      |NPAT   |Debt
A     |31-Dec-09 |123    |4,000
A     |31-Jan-10 |123    |4,000
A     |29-Feb-10 |123    |4,000
B     |29-Feb-10 |12     |300
A     |31-Mar-10 |200    |4,500
B     |31-Mar-10 |12     |300
A     |30-Apr-10 |200    |4,500
B     |30-Apr-10 |12     |300
A     |31-May-10 |200    |4,500
....

I've created a padded table using an inner join resulting in a unique list of all the companies and dates so I'm effectively staring from an empty table containing a full list of company and date combinations. However I'm struggling as to where to start from there.

I'm using mysql and R for this project to I'm happy for a solution/suggest in either. Given the volume of data I'm looking for a fairly efficient implementation.

The following challenges exist: 1-the companies don't exist for the entire time period so I don't want to copy the final period's result forward indefinitely (at most for 2 months). Similarly there will be companies without data in earlier periods. 2-not only may the reporting periods differ, but they may also change, so a company may initially be reporting on a March calendar, but then change to February or January, so before copying there needs to be a check on whether that data already exists.

Thanks for your help.

To clarify it: One company reports in March, June, September, December, another one in February, May, August, November, and you want March be paired with May, not February, right? And February should be copied to December (or the other way around)? What month is considered the first month then? — Andriy M
@AndriyM: I think he deosn't want any pairing. Just filling up of "missing" rows, per company. I guess it could be handled with a calendar table. — ypercubeᵀᴹ
@AndriyM The final result that I need is that for every company results recorded against every month. So any month where there aren't any results reported (April and May in the case for 'A') would need the most recent results (March) copied into their rows. The company then reports in June, so these aren't changed, but come July and August there are no results, so June's results are copied in for July and August. — getting-there
@ypercube: Yeah, thanks for that I started the example in 2012 and forgot to change the 29th back when I changed it to 2010 — getting-there

Vincent Zoonekynd Vincent Zoonekynd · Accepted Answer · 2012-04-08T10:01:14

The easiest is to copy the data for for the next two months, but there will be problems if a company changes its reporting dates.

-- Pseudo-code
CREATE VIEW Tmp1 AS
SELECT Id, 
       Date AS Reported_Date, 
       Date, 
       Value1, Value2 
FROM QuarterlyData
UNION
SELECT Id, 
       Date AS Reported_Date, 
       Date + '1 month' AS Date, -- Replace this with correct date arithmetics
       Value1, Value2 
FROM QuarterlyData
UNION
SELECT Id, 
       Date AS Reported_Date, 
       Date + '2 months' AS Date, -- Replace this with correct date arithmetics
       Value1, Value2 
FROM QuarterlyData;

The following should be safer (and it would also work for daily data). If you have all the desired dates in a table, first join it with the quarterly data (I keep the data for six months, because I do not know what happens when the reporting date changes: could we end up with a quarter with more than 3 months?).

-- Pseudo-code
CREATE VIEW Tmp2 AS 
SELECT A.Id, 
       A.Date AS Reported_Date, 
       B.Date AS Date,
       A.Value1, A.Value2
FROM   Data A, Dates B
WHERE  B.Date <= A.Date 
AND    A.Date < B.Date + '6 months';

Then, remove the duplicates.

CREATE VIEW Tmp_Dates_To_Keep AS
SELECT Id, Date, MAX(Reported_Date) AS Reported_Date 
FROM Tmp1;

SELECT A.Id, 
       A.Date, 
       A.Reported_Date, 
       Value1, Value2
FROM   Tmp2 A, Tmp_Dates_To_Keep B
WHERE  A.Id   = B.Id 
AND    A.Date = B.Date 
AND    A.Reported_Date = B.Reported_Date;

MySQL: How do I pad out table to fill missing rows with existing data

2 Answers