I've got a large (1GB in cvs file) set of quarterly financial data that I need to pad out to monthly data. Each row has a company identifier and date stamp, but different companies have different reporting dates (Mar, June, September, December vs February, May, August, November).
Table: Source
Co. |Date |NPAT |Debt
A |31-Dec-09 |123 |4,000
B |29-Feb-10 |12 |300
A |31-Mar-10 |200 |4,500
B |31-May-10 |11 |200
A |30-Jun-10 |159 |4,300
C |30-Jun-10 |-30 |4
In the example company A reports in March, June, September and December so I need March figures copied to April and May, June copies to July and August, September to October and November and December's figures copied to January and February. For company B the reporting periods are Feb, May, Aug and Nov.
Using the example above what I need is:
Table: Destination
Co. |Date |NPAT |Debt
A |31-Dec-09 |123 |4,000
A |31-Jan-10 |123 |4,000
A |29-Feb-10 |123 |4,000
B |29-Feb-10 |12 |300
A |31-Mar-10 |200 |4,500
B |31-Mar-10 |12 |300
A |30-Apr-10 |200 |4,500
B |30-Apr-10 |12 |300
A |31-May-10 |200 |4,500
....
I've created a padded table using an inner join resulting in a unique list of all the companies and dates so I'm effectively staring from an empty table containing a full list of company and date combinations. However I'm struggling as to where to start from there.
I'm using mysql and R for this project to I'm happy for a solution/suggest in either. Given the volume of data I'm looking for a fairly efficient implementation.
The following challenges exist: 1-the companies don't exist for the entire time period so I don't want to copy the final period's result forward indefinitely (at most for 2 months). Similarly there will be companies without data in earlier periods. 2-not only may the reporting periods differ, but they may also change, so a company may initially be reporting on a March calendar, but then change to February or January, so before copying there needs to be a check on whether that data already exists.
Thanks for your help.