1
votes

I have a dataset structured with an ID and two other variables. The id is not unique, it appears in the dataset more than 1 time (a patient could receive more than one clinical treatment).How can I drop the entire observation (the entire line) only if it is a perfect clone of a previous observation (based on the other two variable values)? I don't want to use an insanely long if statement. Thanks.

3
The specific previous observation (ie if obs 9 is a duplicate of obs 8), or any previous observation? - Joe

3 Answers

3
votes
proc sql;
select distinct * from olddata;
quit;

Sounds like an easy SQL fix. The select distinct option will remove any completely duplicate rows in a dataset if you select all columns.

0
votes

If you specifically want to identify if two consecutive lines are identical (but are not looking to match identical lines separated by other lines), you can use notsorted on a by statement and then first and last variables.

data want;
  set have;
  by id var1 var2 notsorted;
  if first.var2;
run;

That will keep the first record for any group of identical id/var1/var2, so long as they're consecutive on the dataset. Of course if you sort the dataset by id var1 var2 first this will always remove the duplicates, but not sorted this still works for removing consecutive pairs (or more) that are collocated.

0
votes

I prefer @JJFord's answer, but for the sake of completeness, this can also be done using the nodupe option in proc sort:

proc sort data=mydata nodupe;
  by id;
run;

What you choose as the by variable doesn't really matter here. The important bit is just to specify the nodupe option.