0
votes

Looking for ways of counting distinct entries across multiple columns / variables with PROC SQL, all I am coming across is how to count combinations of values. However, I would like to search through 2 (character) columns (within rows that meet a certain condition) and count the number of distinct values that appear in any of the two.

Consider a dataset that looks like this:

DATA have;
INPUT       A_ID    C   C_ID1 $ C_ID2 $;
DATALINES;
            1       1   abc         .
            2       0   .           .
            3       1   efg         abc
            4       0   .           .
            5       1   abc         kli
            6       1   hij         .
;
RUN;

I now want to have a table containing the count of the nr. of unique values within C_ID1 and C_ID2 in rows where C = 1. The result should be 4 (abc, efg, hij, kli):

nr_distinct_C_IDs
4

So far, I only have been able to process one column (C_ID1):

PROC SQL;
    CREATE TABLE try AS
          SELECT 
        COUNT (DISTINCT 
            (CASE WHEN C=1 THEN C_ID1 ELSE ' ' END)) AS nr_distinct_C_IDs
                FROM have;
QUIT;

(Note that I use CASE processing instead of a WHERE clause since my actual PROC SQL also processes other cases within the same query).

This gives me:

nr_distinct_C_IDs
3

How can I extend this to two variables (C_ID1 and C_ID2 in my example)?

2

2 Answers

1
votes

It is hard to extend this to two or more variables with your method. Try to stack variables first, then count distinct value. Like this:

proc sql;
   create table want as
   select count(ID) as nr_distinct_C_IDs from
       (select C_ID1 as ID from have
       union 
       select C_ID2 as ID from have)
    where not missing(ID);
quit;
0
votes

I think in this case a data step may be a better fit if your priority is to come up with something that extends easily to a large number of variables. E.g.

data _null_;
    length ID $3;
    declare hash h();
    rc = h.definekey('ID');
    rc = h.definedone();
    array IDs $ C_ID1-C_ID2;
    do until(eof);
        set have(where = (C = 1)) end = eof;
        do i = 1 to dim(IDs);
            if not(missing(IDs[i])) then do;
                ID = IDs[i];
                rc = h.add();
                if rc = 0 then COUNT + 1;   
            end;
        end;
    end;
    put "Total distinct values found: " COUNT;
run;

All that needs to be done here to accommodate a further variable is to add it to the array.

N.B. as this uses a hash object, you will need sufficient memory to hold all of the distinct values you expect to find. On the other hand, it only reads the input dataset once, with no sorting required, so it might be faster than SQL approaches that require multiple internal reads and sorts.