How can I eliminate subset combinations of binary indicators?

Question

I have a data set where each observation is a combination of binary indicator variables, but not necessarily all possible combinations. I'd like to eliminate observations that are subsets of other observations. As an example, suppose I had these three observations:

var1 var2 var3 var4
   0    0    1    1
   1    0    0    1
   0    1    1    1

In this case, I would want to eliminate observation 1, because it's a subset of observation 3. Observation 2 isn't a subset of anything else, so my output data set should contain observations 2 and 3.

Is there an elegant and preferably fast way to do this in SAS? My best solution thus far is a brute force loop through the data set using a second set statement with the point option to see if the current observation is a subset of any others, but these data sets could become huge once I start working with a lot of variables, so I'm hoping to find a better way.

Something in the range of 20-30 binaries, so a total number of potential combinations in the millions. — MDe
Ah, that's not terribly huge then. Go with whatever you can implement and understand effectively. If you're anywhere near the actual maximum capacity of the combinations, you'll likely have a '111111111111111111111' and be able to reduce quickly. — Joe

Joe Joe · Accepted Answer · 2014-01-21T17:31:14

First off, one consideration: is it possible for one row to have 1 for all indicators? You should check for that first - if one row does have all 1s, then it will always be the unique solution.

_POINT_ is inefficient, but loading into a hash table isn't a terribly bad way to do it. Just load up a hash table with a string of the binary indicators CATted together, and then search that table.

First, use PROC SORT NODUPKEY to eliminate the exact matches. Unless you have a very large number of indicator variables, this will eliminate many rows.

Then, sort it in an order where the more "complicated" rows are at the top, and the less complicated at the bottom. This might be as simple as making a variable which is the sum of binary indicators and sort by that descending; or if your data suggests, might be sorting by a particular order of indicators (if some are more likely to be present). The purpose of this is to reduce the number of times we search; if the likely matches are on top, we will leave the loop faster.

Finally, use a hash iterator to search the list, in descending order by the indicators variable, for any matches.

See below for a partially-tested example. I didn't verify that it eliminated every valid elimination, but it eliminates around half of the rows, which sounds reasonable.

data have;
array vars var1-var20;
do _u = 1 to 1e4;
    do _t = 1 to dim(Vars);
        vars[_t] = round(ranuni(7),1);
    end;
    complexity = sum(of vars[*]);
    indicators = cats(of vars[*]);
    output;
end;
drop _:;
run;

proc sort nodupkey data=have;
by indicators;
run;

proc sort data=have;
by descending complexity;
run;


data want;
if _n_ = 1 then do;
  format indicators $20.;
  call missing(indicators, complexity);
  declare hash indic(dataset:'have', ordered:'d');
  indic.defineKey('indicators');
  indic.defineData('complexity','indicators');
  indic.defineDone();
  declare hiter inditer('indic');
end;

set have(drop=indicators rename=complexity=thisrow_complex); *assuming have has a variable, "indicators", like "0011001";
array vars var1-var20;
rc=inditer.first();
rowcounter=1;
do while (rc=0 and complexity ge thisrow_complex);
    do _t = 1 to dim(vars);
      if vars[_t]=1 and char(indicators,_t) ne '1' then leave;
    end;
    if _t gt dim(Vars) then delete;
    else rc=inditer.next();
    rowcounter=rowcounter+1;
end;
run;

How can I eliminate subset combinations of binary indicators?

2 Answers