3
votes

Matlab Dates mismatch between two sets of data. Help!

Please forgive the simplicity of the question but its my first day;

I am working with two sets of time series: 1) Price of S&P 500 since 1977 (daily close & date) and 2) Bond Yield since 1977 (daily close & date).

The problem is that after a few months the the dates are no longer aligned with each other (maybe the bond market was closed one day and the stock market was opened etc) so I have two data sets that are no longer correctly aligned. Before I even start asking about how to replace the gaps (I will use averages when I get to that bridge), I need to know how to get matlab to align the dates for both securities so that I at least know where the gaps are are for each security i.e at which date is one security missing the price. I was thinking of creating a my own (or using the dates from one of the securities) calendar column and then use that as the base date column to stardedise the final output and match the prices to the new data... maybe this is the wrong way of thinking about it but any help would be much appreciated :)

2

2 Answers

9
votes

Basically you want to perform a full outer merge on two datasets based on the date as key.

Consider the following as example:

%# vector of dates (serial datetime)
days = datenum( num2str((1:31)','2011-10-%02d') );   %'# one month (October 2011)

%# lets build two datasets similar to what you described
idx1 = rand(size(days)) > 0.2;                %# randomly pick dates for 1st
M1 = [days(idx1) rand(sum(idx1),2)*1000];     %# sotcks: days,opening,closing
idx2 = rand(size(days)) > 0.5;                %# randomly pick dates for 2nd
M2 = [days(idx2) rand(sum(idx2),2)*1000];     %# bonds: days,opening,closing

%# get the full range of dates, and convert them to indices starting at 1
[allDays,~,ind] = unique( [M1(:,1);M2(:,1)] );
indM1 = ind(1:size(M1,1));
indM2 = ind(size(M1,1)+1:end);

%# merge the two datasets (days,opening,closing,opening,closing)
M = nan(numel(allDays),size(M1,2)+size(M2,2)-1);
M(:,1) = allDays;                   %# available days from both
M(indM1,2:3) = M1(:,2:3);           %# insert 1st dataset values
M(indM2,4:5) = M2(:,2:3);           %# insert 2nd dataset values

%# final merged dataset formatted
C = [cellstr(datestr(M(:,1),'yyyy-mm-dd')) num2cell(M(:,2:end))]

The result:

C = 
    '2011-10-01'    [     NaN]    [     NaN]    [332.5714]    [241.5017]
    '2011-10-03'    [941.9189]    [ 86.8151]    [     NaN]    [     NaN]
    '2011-10-04'    [655.9138]    [429.3973]    [     NaN]    [     NaN]
    '2011-10-05'    [451.9457]    [257.2828]    [853.0636]    [243.1452]
    '2011-10-06'    [839.6974]    [297.5554]    [     NaN]    [     NaN]
    '2011-10-07'    [532.6235]    [424.8584]    [     NaN]    [     NaN]
    '2011-10-09'    [553.8871]    [119.2073]    [     NaN]    [     NaN]
    '2011-10-11'    [680.0655]    [495.0669]    [442.3979]    [154.1594]
    '2011-10-13'    [367.1899]    [706.4072]    [904.3555]    [956.4164]
    '2011-10-14'    [     NaN]    [     NaN]    [ 33.1794]    [935.6614]
    '2011-10-15'    [239.2906]    [243.5734]    [     NaN]    [     NaN]
    '2011-10-16'    [578.9235]    [785.0701]    [532.4265]    [818.7144]
    '2011-10-17'    [866.8871]    [ 74.0896]    [716.4973]    [728.2618]
    '2011-10-18'    [406.7768]    [393.8834]    [179.3018]    [175.8117]
    '2011-10-19'    [112.6151]    [  3.3941]    [336.5329]    [360.3710]
    '2011-10-20'    [443.8458]    [220.6769]    [     NaN]    [     NaN]
    '2011-10-21'    [     NaN]    [     NaN]    [187.7129]    [188.7900]
    '2011-10-22'    [300.1844]    [  1.3006]    [     NaN]    [     NaN]
    '2011-10-23'    [401.3869]    [189.1797]    [     NaN]    [     NaN]
    '2011-10-24'    [833.3636]    [142.4841]    [321.9272]    [  1.1984]
    '2011-10-25'    [     NaN]    [     NaN]    [403.8567]    [316.4195]
    '2011-10-26'    [403.6287]    [268.0760]    [     NaN]    [     NaN]
    '2011-10-27'    [390.1759]    [174.8921]    [     NaN]    [     NaN]
    '2011-10-28'    [     NaN]    [     NaN]    [548.5663]    [699.6170]
    '2011-10-29'    [360.4489]    [138.6490]    [ 48.7386]    [625.2552]
    '2011-10-30'    [140.2554]    [598.8856]    [552.7321]    [543.0622]
    '2011-10-31'    [260.1302]    [901.0579]    [274.8114]    [439.0372]

The merged result contains opening/closing prices from both datasets. When one of them is not available on a specific date, it is replaced by NaN. Note how there are some unrepresented days in the result, this is because both datasets did not list prices on those days.


Alternatively, you could look into the dataset class from the Statistics Toolbox (which is designed for such cases). Using the same example:

%# build dataset object for the two sets
varNames1 = {'days' 'stock_open' 'stock_close'};
varNames2 = {'days' 'bond_open' 'bond_close'};
d1 = dataset([M1, varNames1]);
d2 = dataset([M2, varNames2]);

%# join on days (full-outer join)
d = join(d1,d2, 'keys','days', 'type','fullouter', 'MergeKeys',true);
d.days = datestr(d.days,'yyyy-mm-dd');   %# format the days column as string

The result:

d = 
    days          stock_open    stock_close    bond_open    bond_close
    2011-10-01       NaN           NaN         332.57        241.5    
    2011-10-03    941.92        86.815            NaN          NaN    
    2011-10-04    655.91         429.4            NaN          NaN    
    2011-10-05    451.95        257.28         853.06       243.15    
    2011-10-06     839.7        297.56            NaN          NaN    
    2011-10-07    532.62        424.86            NaN          NaN    
    2011-10-09    553.89        119.21            NaN          NaN    
    2011-10-11    680.07        495.07          442.4       154.16    
    2011-10-13    367.19        706.41         904.36       956.42    
    2011-10-14       NaN           NaN         33.179       935.66    
    2011-10-15    239.29        243.57            NaN          NaN    
    2011-10-16    578.92        785.07         532.43       818.71    
    2011-10-17    866.89         74.09          716.5       728.26    
    2011-10-18    406.78        393.88          179.3       175.81    
    2011-10-19    112.62        3.3941         336.53       360.37    
    2011-10-20    443.85        220.68            NaN          NaN    
    2011-10-21       NaN           NaN         187.71       188.79    
    2011-10-22    300.18        1.3006            NaN          NaN    
    2011-10-23    401.39        189.18            NaN          NaN    
    2011-10-24    833.36        142.48         321.93       1.1984    
    2011-10-25       NaN           NaN         403.86       316.42    
    2011-10-26    403.63        268.08            NaN          NaN    
    2011-10-27    390.18        174.89            NaN          NaN    
    2011-10-28       NaN           NaN         548.57       699.62    
    2011-10-29    360.45        138.65         48.739       625.26    
    2011-10-30    140.26        598.89         552.73       543.06    
    2011-10-31    260.13        901.06         274.81       439.04   

EDIT:

Say you had the following two files containing the data:

bonds.csv

10/6/1977 7.72 7.72
10/7/1977 7.73 7.73
10/11/1977 7.77 7.77
10/12/1977 7.79 7.79
10/13/1977 7.79 7.79
10/14/1977 7.79 7.79
10/17/1977 7.79 7.79
10/18/1977 7.8 7.8

stocks.csv

10/06/77 95.68 96.05
10/07/77 96.05 95.97
10/10/77 95.97 95.75
10/11/77 95.75 94.93
10/12/77 94.82 94.04
10/13/77 94.04 93.46
10/14/77 93.46 93.56
10/17/77 93.56 93.47

You can read the data using the TEXTSCAN function:

%# read bonds data
fid = fopen('bonds.csv','rt');
C = textscan(fid, '%s %f %f', 'Delimiter',' ', 'CollectOutput',true);
fclose(fid);
M1 = [datenum(C{1},'mm/dd/yyyy') C{2}];

%# read stocks data
fid = fopen('stocks.csv','rt');
C = textscan(fid, '%s %f %f', 'Delimiter',' ', 'CollectOutput',true);
fclose(fid);
M2 = [datenum(C{1},'mm/dd/yy') C{2}];

Now you can use the same code above (starting at "get the full range of dates...", or use the DATASET class). After joining them, this gives me:

C = 
    '1977-10-06'    [7.72]    [7.72]    [95.68]    [96.05]
    '1977-10-07'    [7.73]    [7.73]    [96.05]    [95.97]
    '1977-10-10'    [ NaN]    [ NaN]    [95.97]    [95.75]
    '1977-10-11'    [7.77]    [7.77]    [95.75]    [94.93]
    '1977-10-12'    [7.79]    [7.79]    [94.82]    [94.04]
    '1977-10-13'    [7.79]    [7.79]    [94.04]    [93.46]
    '1977-10-14'    [7.79]    [7.79]    [93.46]    [93.56]
    '1977-10-17'    [7.79]    [7.79]    [93.56]    [93.47]
    '1977-10-18'    [ 7.8]    [ 7.8]    [  NaN]    [  NaN]
3
votes

You might have a problem if you just use the dates from either one of the series because each of them might have dates that are missing in the other. What I would do is start with a clean 3-column matrix that contains all the weekdays in the date range. This post on the Mathworks blog can offer some insight on how to do it. Then fill in the two other columns with the values from your two data series. This way you will be sure all your values are in the matrix and will make your life simpler if you decide to add more data.

As for filling in the missing dates, you can just use : the 1-D interpolate function