2
votes

I want to eliminate duplicates from a database, based on an identifier, an order and a condition.

More precisely, I have data with several observations. I have sometimes a condition that makes me want to keep that observation anyway (let fix it condition=1), but then also keep the observation with the same identifier even if this condition does not hold (condition=0).

But if I have for one identifier several observations where condition=0 then I want to elminate duplicates, with criterion being having the greatest order.

Without the condition I can do that

proc sort data=have;
    by identifier descending order;
run;

proc sort nudopkey data=have;
    by identifier;
run;

But how to incorporate my condition in this ?

Edit 1 : add a database example :

data Test; 
   input identifier $ order condition; 
   datalines;
1023 1 0
1023 2 0
1064 2 0
1064 1 0
1098 1 0
1098 1 1
;          

Then I want to keep

  • 1023 2 0
  • 1064 2 0
  • 1098 1 0
  • 1098 1 1

Edit 2 : tried to precise my conditions

2

2 Answers

2
votes

I presume you want to eliminate duplicates only when the condition for all records for an identifier is set to 0. In that case you want to keep the record with the maximum order and eliminate all other records with the same identifier.

Proc sql; 
         create  table   want    as 
         select  * 
         from    test 
         group   by      identifier 
         having  max (condition) ne      0 
         or      order           eq      max (order) 
         ; 
Quit; 

This will keep all rows for an Identifier where the maximum condition = 1, or in the case of those where maximum condition = 0, select the row with the maximum order.

Is that what you want?

1
votes

Some of this depends on how you define 'condition'. Is your condition easily verifiable on every record for that identifier? Then you can do something like this.

  1. Evaluate the condition.
  2. For records where it is true (you want to remove the duplicate), set flag=0. For records where it is not true, increment the condition flag by one.

If the condition is true for all records in that ID, all will have the same value (flag=0) and nodupkey on by identifier flag; will remove extras. If the condition is false for all records, those will not be removed. If it's true for some and false for some, and you want to remove only some of the records with that identifier (only the duplicates where it is true), then you have to make sure that either it's sorted to have all of the condition=true records at top, or have a separate flag counter that determines what value the flag will be (since it sometimes will go to 0 in the middle, so 0 0 0 1 2 3 0 4 5 6 is what you want, not 0 0 0 1 2 3 0 0 1 2 ).

Perhaps easier to see is to do it within a datastep. After sorting by identifier descending order:

data want;
  set have;
  by identifier descending order;
  if (condition=true) and not (first.identifier) then delete;
run;

This will, again, work if either condition=true is always at the top, or if it's always consistent within one ID group. If it's inconsistent and mixed, then you need to keep track of whether you've kept one where it was true (assuming you want to), or it might delete all records where it is true; use a separate variable to keep track of how many you've kept. first.identifier will be 1/TRUE for the first record for that identifier only, not taking into account the condition. You could also create the flag, then sort by identifier flag descending order; and guarantee the condition=true are at the top (either by making flag=0 for true, or sorting by descending flag.)