0
votes

I have the following dataset in Stata

Id | date1    | date2
------------------------
1  | 28mar2005| 30jun2005
2  | 28mar2005| 31dec2005
3  | 28mar2005| 28febr2005
3  | 28mar2005| 30apr2005
3  | 28mar2005| 31dec2005

The values of the second column are the same for all observations. The year considered is the same in date1 and date2.

The day in date2 is always the last day of the relative month.

The dates in date2 are not repeated and are sorted within Id.

Within Id, I want to keep the observations of the dataset such that the "time distance" between "mm" of date1 and "mm" of date2 is the smallest. In case of indifference between n observations I want to keep the last observations displayed among the n observations. Hence, I want to get

Id | date1    | date2
------------------------
1  | 28mar2005| 30jun2005
2  | 28mar2005| 31dec2005
3  | 28mar2005| 30apr2005

My idea would be

(a) generate an "absolute value distance" between date1 and date2

(b) find the minimum absolute value distance by Id

(c) If a certain Id is listed more than once, keep the observation with the minimum absolute value distance

Any suggestion?

2

2 Answers

2
votes

%td dates are integers, where zero is 01/01/1960. This means that subtraction will give you the days between them:

clear 
input Id  str9 date1 str9 date2
1   28mar2005 30jun2005
2   28mar2005 31dec2005
3   28mar2005 28feb2005
3   28mar2005 30apr2005
3   28mar2005 31dec2005
3   27mar2005 27feb2005
end

gen DATE1 =date(date1,"DMY")
gen DATE2 =date(date2,"DMY")
format DATE? %td
drop date1 date2
rename DATE?, lower

gen diff = abs(date1-date2)
gsort Id diff -date2
bys Id: keep if _n==1

list, clean noobs

Note that I added the last row to create an instance of the "indifference" scenario (if there are ties in terms of absolute distance, keep the later date2).

1
votes

I have some doubts.

For example, why do you want 30apr2005, if 28feb2005 is closer to 28mar2005?

I'm not sure what you mean by "In case of indifference between n observations...", so I've interpreted freely. I adjusted your example data to reflect that. In particular, note that there are two observations that share the same date1 and date2 for id == 3, but different metric. I keep only one.

clear
set more off

*----- example data -----

input ///
id str15(date1 date2)
1   28mar2005 30jun2005
2   28mar2005 31dec2005
3   28mar2005 28febr2005
3   28mar2005 28febr2005
3   28mar2005 30apr2005
3   28mar2005 31dec2005
end

set seed 128345
gen metric = floor(runiform()*100)

gen date11 = date(date1, "DMY")
gen date22 = date(date2, "DMY")

format %td date??

drop date?
list, sepby(id)

*----- what you want? -----

gen diff = abs(date11 - date22)

bysort id : gen obs = _n
bysort id (diff obs) : keep if diff == diff[1] 
by id: keep if _n == _N

list, sepby(id)

You can maybe save yourself a sort in the code, but I'd leave it if it doesn't cause any noticeable speed slowdown.