I need to delete duplicates from a data set. My issue is that once I sort the data and flag the duplicates (using lag function), some information across variables is present within the duplicate observation and some within the original observation. I need to retain information across all variables while also deleting the duplicates.
My thought was to first fill in all the information between both the original and duplicate before deleting the duplicate.
Example of observations after sorting data and flagging duplicates (fake data values):
Province AGE BRTHYEAR Trans_id Morb_id VarX flag_duplicate
AB 36 1980 45654 . . 0
AB 36 1980 . . 2135 1
ON 26 1990 . . 8868 0
ON 26 1990 . 35464 8868 1
What I want:
Province AGE BRTHYEAR Trans_id Morb_id VarX flag_duplicate
AB 36 1980 45654 . 2135 0
AB 36 1980 45654 . 2135 1
ON 26 1990 . 35464 8868 0
ON 26 1990 . 35464 8868 1
So I can delete duplicates and eventually have this:
Province AGE BRTHYEAR Trans_id Morb_id VarX flag_duplicate
AB 36 1980 45654 . 2135 0
ON 26 1990 . 35464 8868 0
I created lag and lead variables to attempt to fill in information but it only seems to be working on some of the data set.
Here is the code for the lead variables:
data uncleaned_data;
merge uncleaned_data
uncleaned_data(
firstobs=2
keep= TRANS_ID MORB_ID Varx
rename=(TRANS_ID=lead_TRANS_ID MORB_ID=lead_MORB_ID Varx=lead_Varx ));
if lag(flag_duplicate=1) then do;
if TRANS_ID=. then do;
TRANS_ID= lead_TRANS_ID;
end;
if MORB_ID=. then do;
MORB_ID= lead_MORB_ID;
end;
if Varx=. then do;
Varx= lead_Varx;
end;
end;
run;
I did the same kind of thing for lag variables except my initial if statement is 'if flag_duplicate=1 then do;'
This method does not seem to work for many duplicate pairs in my data set.
Is there a better way to approach my problem overall? possibly through proc SQL?
Thanks for reading and any advice offered!