2
votes

In SAS, how can I create an identifier for each unique combination of a set of variables?

I have, for example, a several thousand observations with a dichotomous value for six variables. There are 2^6 unique combinations for the values of these variables for each observation. I would like to create an identifier for each unique combination, and eventually group my observations according to this value.

Have:

SubjectID     Var1     Var2     Var3     Var4     Var5     Var6
---------------------------------------------------------------
ID1           1        1        1        1        1        1        
ID2           1        0        1        1        1        1  
ID3           0        1        1        1        1        1  
ID4           0        0        1        1        1        0  
...           ...      ...      ...      ...      ...      ...
ID3000        1        1        0        1        0        0  

Want:

SubjectID     Var1     Var2     Var3     Var4     Var5     Var6     Identifier
------------------------------------------------------------------------------
ID1           1        1        1        1        1        1        A        
ID2           1        1        1        1        1        1        A
ID3           0        1        1        1        1        1        B  
ID4           0        0        1        1        1        0        C  
...           ...      ...      ...      ...      ...      ...
ID3000        1        1        0        1        0        0        Z

A would represent 1, 1, 1, 1, 1, 1 as a unique combination and B would represent 0, 1, 1, 1, 1, 1 etc.

I have thought about creating a dummy variable based on 64 Var1-Var6 conditional statements. I've also thought about concatenating the values from Var1-Var6 into a new row to create a unique identifier.

Is there a more straightforward way of going about this?

I prefer an approach that assigns a specific identifier to a specific combination of the values, rather than one that just generates some arbitrary unique string whenever a new combination comes up.

4

4 Answers

1
votes

Proc summary works well with the LEVELS option. This technique works for any values of the group variables numeric or character.

data have;
   input (v1-v6)(1.);
   cards;
111111
111110
111101
111011
110111
;;;;
proc print;
proc summary data=have nway;
   class v1-v6;
   output out=unique(drop=_type_) / levels;
   run;

enter image description here

0
votes

Why not just concatenate the values? So your combinations are:

111111
111110
111101
111011
110111
....

You can use PROC FREQ to check the number of each type.

proc freq data=have;
table var1*var2*var3*var4*var5*var6 / out=want list;
run;
0
votes

By using the unique values of the given variables' combinations and then creating an alphabetical List of Ids, you can create the result

data inp;
length combined $6.;
input subjectid $4. v1 1. v2 1. v3 1. v4 1. v5 1. v6 1.;
combined=compress(v1||v2||v3||v4||v5||v6);
datalines;
ID1 111111
ID2 011111
ID3 001111
ID4 111110
ID5 000111
ID6 111111
ID7 000111
;
run;

proc sql;
create table uniq
as
select distinct combined from inp order by combined desc;
quit;

data uniq1;
 set uniq;
 retain alphabet 65;
 Id=byte(alphabet) ;
 alphabet+1;

 drop alphabet;
run;

proc sql;
create table final_ds
as
select subjectid, v1, v2, v3, v4, v5, v6, Id
from inp a
left join uniq1 b
on a.combined=b.combined;
quit;
0
votes

Assuming the data is sorted by your grouping variables then just use BY group processing.

data want;
  set have;
  by var1-var6 ;
  groupid + first.var6 ;
run;

Or you could just convert the 6 binary variables into a single unique value.

group2 = input(cats(of var1-var6),binary6.);

This has the added value of not requiring that you sort the data, but it does need for none of the grouping variables to be missing.

Result

SubjectID  Var1 Var2 Var3 Var4 Var5 Var6 Identifier Want groupno group2
ID4  0  0  1  1  1  0 C  1  14
ID3  0  1  1  1  1  1 B  2  31
ID1  1  1  1  1  1  1 A  3  63
ID2  1  1  1  1  1  1 A  3  63