0
votes

I need to delete duplicates from a data set. My issue is that once I sort the data and flag the duplicates (using lag function), some information across variables is present within the duplicate observation and some within the original observation. I need to retain information across all variables while also deleting the duplicates.

My thought was to first fill in all the information between both the original and duplicate before deleting the duplicate.

Example of observations after sorting data and flagging duplicates (fake data values):

Province  AGE  BRTHYEAR   Trans_id  Morb_id    VarX   flag_duplicate
AB        36    1980       45654      .         .           0
AB        36    1980         .        .        2135         1
ON        26    1990         .        .        8868         0
ON        26    1990         .      35464      8868         1

What I want:

Province  AGE   BRTHYEAR   Trans_id  Morb_id    VarX   flag_duplicate
AB        36     1980       45654      .        2135         0
AB        36     1980       45654      .        2135         1
ON        26     1990         .      35464      8868         0
ON        26     1990         .      35464      8868         1

So I can delete duplicates and eventually have this:

Province   AGE  BRTHYEAR   Trans_id  Morb_id    VarX   flag_duplicate
AB         36     1980       45654      .        2135         0
ON         26     1990         .      35464      8868         0

I created lag and lead variables to attempt to fill in information but it only seems to be working on some of the data set.

Here is the code for the lead variables:

data uncleaned_data;
merge uncleaned_data
    uncleaned_data(
    firstobs=2
    keep= TRANS_ID MORB_ID Varx
    rename=(TRANS_ID=lead_TRANS_ID MORB_ID=lead_MORB_ID Varx=lead_Varx ));

if lag(flag_duplicate=1) then do;

    if TRANS_ID=. then do;
    TRANS_ID= lead_TRANS_ID;
    end;

 if MORB_ID=. then do;
    MORB_ID= lead_MORB_ID;
    end;

 if Varx=. then do;
    Varx= lead_Varx;
    end;

end;
run;

I did the same kind of thing for lag variables except my initial if statement is 'if flag_duplicate=1 then do;'

This method does not seem to work for many duplicate pairs in my data set.

Is there a better way to approach my problem overall? possibly through proc SQL?

Thanks for reading and any advice offered!

2

2 Answers

1
votes

I'm assuming that you don't have different values of Trans_id, for example, for the same Province. If that is the case then you can flatten the original data in one go to achieve your goal, using an update statement with a by statement. In my code, the first reference to the dataset, with obs=0, just creates the variables, the second reference populates the values and the by statement ensures that only one row is updated per Providence.

Using this method means you don't need to identify the duplicate values beforehand.

data have;
input Province $  AGE  BRTHYEAR   Trans_id  Morb_id    VarX   flag_duplicate;
datalines;
AB        36    1980       45654      .         .           0
AB        36    1980         .        .        2135         1
ON        26    1990         .        .        8868         0
ON        26    1990         .      35464      8868         1
;
run;

data want;
update have(obs=0) have;
by province;
run;
0
votes

Something like this should work...

proc sort data=uncleaned_data; by Province  AGE  BRTHYEAR; run;

data cleaned_data (DROP=TRANS_ID RENAME=(KEEP_TRANS_ID=TRANS_ID) ...);
set uncleaned_data;
by Province  AGE  BRTHYEAR;
if first.BRTHYEAR then do;
keep_TRANS_ID=TRANS_ID;
...
end;
else do;
if keep_TRANS_ID=. then keep_TRANS_ID=TRANS_ID;
...
end;
if last.BRTHYEAR then output;
run;