Efficient way of finding number of links between observations

Question

There is a data set of the following structure:

firm_ID    partner_ID    start_date    end_date
1          2             01jan2001     15mar2001
1          3             15feb2001     30apr2001
2          4             20mar2001     14may2001

The data set shows the duration of partnership of firm_ID with partner_ID. Here, firm_ID and partner_ID (both integers) refer to a unique identifier of the firms .

Using Stata, I want to create a summary table, which shows for each unique firm_ID the number of its partnerships, by month and year. If the partnership existed for at least one day in a given month, it should be counted. If firm 1 reports a partnership with firm 2 but firm 2 doesn't report a partnership with firm 1, then it should only be counted for firm 1 and not firm 2.

For the example above, the summary table is:

firm_ID    01/2001    02/2001    03/2001    04/2001    05/2001
1          1          2          2          1          0
2          0          0          1          1          1

The variable firm_ID has 8,000 unique observations and the time span is 01/2001 - 12/2013. Thus, the resulting summary table should have the number of rows equal to 8,000 and the number of columns equal to 12*13=156 (number of months during 01/2001 - 12/2013).

Conceptually, what is the algorithm of solving this task using Stata? Thank you.

Nick Cox Nick Cox · Accepted Answer · 2017-02-22T11:18:53

You need to work with monthly dates. Note in passing that daily dates presented as you gave them need reverse engineering to be part of calculations. Type ssc inst dataex for a command to generate reproducible Stata data examples for public forums.

tabulate works for your toy example; for your real example, you will need a different tabulation to cope with many more rows and columns: see for example groups from SSC.

clear 
input firm_ID    partner_ID   str9 (s_start_date   s_end_date) 
1          2             01jan2001     15mar2001
1          3             15feb2001     30apr2001
2          4             20mar2001     14may2001   
end 

foreach v in start end { 
    gen `v'_date = daily(s_`v'_date, "DMY") 
    gen `v'_month = mofd(`v'_date) 
    format `v'_date %td 
    format `v'_month %tm 
} 

gen duration = end_month - start_month + 1 
expand duration 
bysort firm_ID partner_ID : gen month = start_month + _n - 1 
format month %tm 
tab firm_ID month 

           |                         month
   firm_ID |    2001m1     2001m2     2001m3     2001m4     2001m5 |     Total
-----------+-------------------------------------------------------+----------
         1 |         1          2          2          1          0 |         6 
         2 |         0          0          1          1          1 |         3 
-----------+-------------------------------------------------------+----------
     Total |         1          2          3          2          1 |         9 


groups firm_ID month, fillin show(f) sepby(firm_ID)

  +--------------------------+
  | firm_ID    month   Freq. |
  |--------------------------|
  |       1   2001m1       1 |
  |       1   2001m2       2 |
  |       1   2001m3       2 |
  |       1   2001m4       1 |
  |       1   2001m5       0 |
  |--------------------------|
  |       2   2001m1       0 |
  |       2   2001m2       0 |
  |       2   2001m3       1 |
  |       2   2001m4       1 |
  |       2   2001m5       1 |
  +--------------------------+

Efficient way of finding number of links between observations

1 Answers