2
votes

I have a data set where each observation is a combination of binary indicator variables, but not necessarily all possible combinations. I'd like to eliminate observations that are subsets of other observations. As an example, suppose I had these three observations:

var1 var2 var3 var4
   0    0    1    1
   1    0    0    1
   0    1    1    1

In this case, I would want to eliminate observation 1, because it's a subset of observation 3. Observation 2 isn't a subset of anything else, so my output data set should contain observations 2 and 3.

Is there an elegant and preferably fast way to do this in SAS? My best solution thus far is a brute force loop through the data set using a second set statement with the point option to see if the current observation is a subset of any others, but these data sets could become huge once I start working with a lot of variables, so I'm hoping to find a better way.

2
Something in the range of 20-30 binaries, so a total number of potential combinations in the millions.MDe
Ah, that's not terribly huge then. Go with whatever you can implement and understand effectively. If you're anywhere near the actual maximum capacity of the combinations, you'll likely have a '111111111111111111111' and be able to reduce quickly.Joe

2 Answers

3
votes

First off, one consideration: is it possible for one row to have 1 for all indicators? You should check for that first - if one row does have all 1s, then it will always be the unique solution.

_POINT_ is inefficient, but loading into a hash table isn't a terribly bad way to do it. Just load up a hash table with a string of the binary indicators CATted together, and then search that table.

First, use PROC SORT NODUPKEY to eliminate the exact matches. Unless you have a very large number of indicator variables, this will eliminate many rows.

Then, sort it in an order where the more "complicated" rows are at the top, and the less complicated at the bottom. This might be as simple as making a variable which is the sum of binary indicators and sort by that descending; or if your data suggests, might be sorting by a particular order of indicators (if some are more likely to be present). The purpose of this is to reduce the number of times we search; if the likely matches are on top, we will leave the loop faster.

Finally, use a hash iterator to search the list, in descending order by the indicators variable, for any matches.

See below for a partially-tested example. I didn't verify that it eliminated every valid elimination, but it eliminates around half of the rows, which sounds reasonable.

data have;
array vars var1-var20;
do _u = 1 to 1e4;
    do _t = 1 to dim(Vars);
        vars[_t] = round(ranuni(7),1);
    end;
    complexity = sum(of vars[*]);
    indicators = cats(of vars[*]);
    output;
end;
drop _:;
run;

proc sort nodupkey data=have;
by indicators;
run;

proc sort data=have;
by descending complexity;
run;


data want;
if _n_ = 1 then do;
  format indicators $20.;
  call missing(indicators, complexity);
  declare hash indic(dataset:'have', ordered:'d');
  indic.defineKey('indicators');
  indic.defineData('complexity','indicators');
  indic.defineDone();
  declare hiter inditer('indic');
end;

set have(drop=indicators rename=complexity=thisrow_complex); *assuming have has a variable, "indicators", like "0011001";
array vars var1-var20;
rc=inditer.first();
rowcounter=1;
do while (rc=0 and complexity ge thisrow_complex);
    do _t = 1 to dim(vars);
      if vars[_t]=1 and char(indicators,_t) ne '1' then leave;
    end;
    if _t gt dim(Vars) then delete;
    else rc=inditer.next();
    rowcounter=rowcounter+1;
end;
run;
1
votes

I'm prety sure their is probably a more math-oriented way of doing this but for now this what I can think about. Proceed with caution as I only checked on a small no. of test cases.

My pseudo-algorithm: (Bit pattern= concatenation of all binary variables into a string.)

  1. get a unique list of binary(bit) patterns which is sorted in asceding order (this is what the PROC SQL step is doing
  2. Set up two arrays. 1 array to track the variables (var1 to var4) on the current row and another array to track lag(of var1 to var4).
  3. If a bit pattern is a subset of another bit pattern (lagged version) then "dot product vector multiplication" should give back the lagged bit pattern.
  4. If bit pattern = lagged_bit_pattern then flag that pattern to be excluded.

In the last data step you will get the list of bit patterns you need to exclude. NOTE: this approach does not take care of duplicate patterns such as the following: record1: 1 0 0 1 record2: 1 0 0 1 which can be easily excluded via PROC SORT & NODUPKEY.

/*sample data*/
data sample_data;
input id var1-var4;
bit_pattern = compress(catx('',var1,var2,var3,var4));
datalines;
 1 0 0 1 1
 2 1 0 0 1
 3 0 1 1 1
 4 0 0 0 1
 5 1 1 1 0
;
run;
/*in the above example, 0001 0011 need to be eliminated. These will be highlighted in the last datastep*/
/*get unique combination of patterns in the dataset*/
proc sql ;
create table all_poss_patterns as 
select 
var1,var2, var3,var4,
 count(*) as freq
from sample_data
group by var1,var2, var3,var4
order by var1,var2, var3,var4;
quit;

data patterns_to_exclude;
set all_poss_patterns;
by var1-var4;

length lagged1-lagged4 8;
array first_array{*} var1-var4;
array lagged{*}lagged1-lagged4;

length bit_pattern $32.;
length lagged_bit_pattern $32.;
bit_pattern = '';
lagged_bit_pattern='';


do i = 1 to dim(first_array);
lagged{i}=lag(first_array{i});
end;


do i = 1 to dim(first_array);
               bit_pattern=cats("", bit_pattern,lagged{i}*first_array{i});
               lagged_bit_pattern=cats("",lagged_bit_pattern,lagged{i});
end;
if bit_pattern=lagged_bit_pattern then exclude_pattern=1;
else exclude_pattern=0;
/*uncomment the following two lines to just keep the patterns that need to be excluded*/
/*if bit_pattern ne '....' and exclude_pattern=1;*/ /*note the bit_pattern ne '....' the no. of dots should equal no. of binary vars*/
/*keep bit_pattern;*/
run;