1
votes

I have a dataset analogous to the simplified table below (let's call it "DS_have"):

SurveyID    Participant FavoriteColor   FavoriteFood    SurveyMonth
S101        G92         Blue            Pizza           Jan
S102        B34         Blue            Cake            Feb
S103        Z28         Green           Cake            Feb
S104        V11         Red             Cake            Feb
S105        P03         Yellow          Pizza           Mar
S106        A71         Red             Pizza           Mar
S107        C48         Green           Cake            Mar
S108        G92         Blue            Cake            Apr
...

I'd like to create a set of numeric variables that identify the discrete categories/levels of each variable in the dataset above. The result should look like the following dataset ("DS_want"):

SurveyID    Participant FavoriteColor   FavoriteFood    SurveyMonth ColorLevels FoodLevels  ParticipantLevels   MonthLevels
S101        G92        Blue             Pizza           Jan                   1          1                  1             1
S102        B34        Blue             Cake            Feb                   1          2                  2             2
S103        Z28        Green            Cake            Feb                   2          2                  3             2
S104        V11        Red              Cake            Feb                   3          2                  4             2
S105        P03        Yellow           Pizza           Mar                   4          1                  5             3
S106        A71        Red              Pizza           Mar                   3          1                  6             3
S107        C48        Green            Cake            Mar                   2          2                  7             3
S108        G92        Blue             Cake            Apr                   1          1                  1             4
...

Essentially, I want to know what syntax I should use to generate unique numerical values for each "level" or category of variables in the DS_Have dataset. Note that I cannot use conditional if/then statements to create the values in the ":Levels" variables for each category, as the number of levels for some variables is in the thousands.

2
If you can/do use PROC IML edit with that tag as that may be easier than base SAS...Joe

2 Answers

2
votes

One straightforward solution is to use proc tabulate to generate a tabulated list, then iterate over that and create informats to convert the text to a number; then you just use input to code them.

*store variables you want to work with in a macro variable to make this easier;
%let vars=FavoriteColor FavoriteFood SurveyMonth;

*run a tabulate to get the unique values;
proc tabulate data=have out=freqs;
  class &vars.;
  tables (&vars.),n;
run;

*if you prefer to have this in a particular order, sort by that now - otherwise you may have odd results (as this will).  Sort by _TYPE_ then your desired order.;


*Now create a dataset to read in for informat.;
data for_fmt;
  if 0 then set freqs;
  array vars &vars.;
  retain type 'i';
  do label = 1 by 1 until (last._type_);  *for each _type_, start with 1 and increment by 1;
    set freqs;
    by _type_ notsorted;
    which_var = find(_type_,'1');  *parses the '100' value from TYPE to see which variable this row is doing something to.  May not work if many variables - need another solution to identify which (depends on your data what works);

    start = coalescec(vars[which_var]);
    fmtname = cats(vname(vars[which_var]),'I');
    output;
    if first._type_ then do; *set up what to do if you encounter a new value not coded - set it to missing;
      hlo='o';  *this means OTHER;
      start=' ';
      label=.;
      output;
      hlo=' ';
      label=1;
    end;
  end;
run;

proc format cntlin=for_fmt;  *import to format catalog via PROC FORMAT;
quit;

Then code them like this (you might create a macro to do this looping over the &vars macro variable).

data want;
  set have;
  color_code = input(FavoriteColor,FavoriteColorI.);
run;
0
votes

Another approach - create a hash object to keep track of the levels encountered for each variable, and read the dataset twice via a double DOW-loop, applying the level numbers on the second pass. It's perhaps not as elegant as Joe's solution, but it should use slightly less memory and I suspect it will scale to a somewhat larger number of variables.

%macro levels_rename(DATA,OUT,VARS,NEWVARS);
    %local i NUMVARS VARNAME;

    data &OUT;
    if 0 then set &DATA;
    length LEVEL 8;
    %let i = 1;
    %let VARNAME = %scan(&VARS,&i);
    %do %while(&VARNAME ne );
        declare hash h&i();
        rc = h&i..definekey("&VARNAME");
        rc = h&i..definedata("LEVEL");
        rc = h&i..definedone();
      %let i = %eval(&i + 1);
      %let VARNAME = %scan(&VARS,&i);
    %end;
    %let NUMVARS = %eval(&i - 1);
    do _n_ = 1 by 1 until(eof);
        set &DATA end = eof;
      %do i = 1 %to &NUMVARS;
        LEVEL = h&i..num_items + 1;
        rc = h&i..add();
      %end;
    end;
    do _n_ = 1 to _n_;
      set &DATA;
      %do i = 1 %to &NUMVARS;
        rc = h&i..find();
        %scan(&NEWVARS,&i) = LEVEL;
      %end;
      output;
    end;
    drop LEVEL;
    run;
%mend;

%levels_rename(sashelp.class,class_renamed,NAME SEX, NAME_L SEX_L);