SAS Most efficient way to eliminate duplicate

Question

For starter I know my problem is similar to This(which is the closest to my question I have found), but with some differences at the same time, hence my new post.

I have a database with an identifier and declarations. Declarations are constructed as identifier + a letter. If the idendifier is 123456, declarations would then be "123456A", "123456B" and so on

I would like to select one observation for each identifier, with the declaration that is the one with the last letter, which is of course, not always the same.

I assume I can do that with a proc sort and then another one with nodupkey :

proc sort data=have out=have2;
    by identifier declaration /descending;
run;

proc sort data=have2 out=want nodupkey;
    by declaration;
run;

but as I have a relatively important database (tens of millions observations) I would like to know the best in sense of both better suited and fastest method if it is another one. Typically, if it is possible in one step.

Thanks

If you have enough memory try Joe's hash solution in the link above which may be faster. — Reeza
I have several millions distinct identifiers. To be more precise, most identifiers have only one declaration, about 20% have two, and 1% or so have 3 or more. — Anthony Martin

criticalth criticalth · Accepted Answer · 2014-10-02T14:30:14

This looks like a quick solution. It sets only the first observation (in your case the last as you have already sorted by descending). Meaning the rest of the records will not be even loaded into the program data vector. If possible please let me know how it went. I am curious if this would be optimal. I know this to be true only in thoery. I have never tested it myself on a large dataset. 10x

data want; 
     do until ( first.identifier ) ; 
         set have; 
         by identifier ;
     end ;
run;

SAS Most efficient way to eliminate duplicate

2 Answers