0
votes

I have a dataset with an identifier, with several obsevations for each identifier, let us call it ident, and a categorical variable var, that can take several values, among them 1.

How do I keep all observations corresponding to a common identifier if for just one of the observations I have var=var1

For instance, with

data Test; 
    input identifier var; 
    datalines;
1023 1
1023 3
1023 5
1064 2
1064 3
1098 1
1098 1
; 

Then I want to keep

  • 1023 1
  • 1023 3
  • 1023 5
  • 1098 1
  • 1098 1
3
Is your data already sorted by identifier and var?Jeff
it can be without particular troubleAnthony Martin

3 Answers

1
votes

The easiest way I can think of is to create a table of the identifier and then join back to it.

data temp_ID;
set TEST;
where var = 1;
run;

proc sql;
create table output_data as select
b.*
from temp_ID a
left join TEST b
on a.identifier=b.identifier;
quit;
1
votes

Assuming your data is already sorted by identifier and var, you can do this with one pass. You can tell at the first line whether or not that identifier should be output.

data want (drop=keeper);
    set test;
    by identifier;
    length keeper 3;
    retain keeper;
    if first.identifier then do;
        if var = 1 then keeper = 1;
        else keeper= 0;
    end;
    if keeper = 1 then output;
run;
1
votes

Here's the one pass solution that works for any arbitrary value. (It is a one pass solution as long as your BY group is small enough to fit into memory, which usually is the case).

%let var1=3;

data want;
  do _n_ = 1 by 1 until (last.identifier);
    set test
    by identifier;
    if var=&var1. then keepflag=1;
  end;
  do _n_ = 1 by 1 until (last.identifier);
    set test;
    by identifier;
    if keepflag then output;
  end;
run;

That's going through the by group once, setting keepflag=1 if any row in the by group is equal to the value, then keeping all rows from that by group. Buffering will mean this doesn't reread the data twice as long as the by group fits into memory.