2
votes

Let's say I have the following data:

id  disease
1   0
1   1
1   0
2   0
2   1
3   0
4   0
4   0

I would like to remove the duplicate observations in Stata. For example

id   disease
1      1
2      1
3      0
4      0

For group id=1, keep observation 2

For group id=2, keep observation 2

For group id=3, keep observation 1 (because it has only 1 obs)

For group id=4, keep observation 1 (or any of them but one obs)

I am trying Stata duplicates command,

duplicates tag id if disease==0, generate(info)
drop if info==1

but it's not working as I required.

1
Thanks for specifying code in an edit. If you like the answer, you can accept it and gain some reputation thereby.Nick Cox

1 Answers

2
votes

It is no surprise that duplicates does not do what you are wanting, as it does not fit your problem. For example, the observation with id == 2, disease == 0 is not a duplicate of any other observation. More generally, duplicates does not purport to be a general-purpose command for dropping observations you don't want.

Your criteria appear to be

  1. Keep one observation for each id.

  2. If id has any observation with value of 1, that is to be kept.

A solution to that is

bysort id (disease) : keep if _n == _N 

That keeps the last observation for each distinct id: after sorting within id on disease observations with the disease are necessarily at the end of each group.