3
votes

I have a database in which some of the observations have an identifier ident, and some not. I want to create a new database in which I have dropped the observations which are duplicates of my ident variable, but to keep the observations where ident is missing.

If I simply do a proc sort nodupkey

proc sort nodupkey data=have;
    by ident;
    run;

Then it also eliminates the missing values. Is there a simple way to do that (that is not break the dataset, proc sort nodupkey one partn, then assemble it again)

1

1 Answers

4
votes

You have a couple of options when removing duplicates.

First off, dupout=<dataset> on the proc sort will send all of your duplicates to another dataset, and if you want to then do something with them you can. But this is a back-end version of your 'break the dataset', just probably faster as it only breaks the smaller part.

Simpler is to do the dedup yourself.

proc sort data=have;
  by ident;
run;

data want;
  set have;
  by ident;
  if (first.ident) or missing(ident);
run;

That keeps the first record for each ident, plus any record with ident missing.