PROC SQL - Counting distinct values across variables

Question

Looking for ways of counting distinct entries across multiple columns / variables with PROC SQL, all I am coming across is how to count combinations of values. However, I would like to search through 2 (character) columns (within rows that meet a certain condition) and count the number of distinct values that appear in any of the two.

Consider a dataset that looks like this:

DATA have;
INPUT       A_ID    C   C_ID1 $ C_ID2 $;
DATALINES;
            1       1   abc         .
            2       0   .           .
            3       1   efg         abc
            4       0   .           .
            5       1   abc         kli
            6       1   hij         .
;
RUN;

I now want to have a table containing the count of the nr. of unique values within C_ID1 and C_ID2 in rows where C = 1. The result should be 4 (abc, efg, hij, kli):

nr_distinct_C_IDs
4

So far, I only have been able to process one column (C_ID1):

PROC SQL;
    CREATE TABLE try AS
          SELECT 
        COUNT (DISTINCT 
            (CASE WHEN C=1 THEN C_ID1 ELSE ' ' END)) AS nr_distinct_C_IDs
                FROM have;
QUIT;

(Note that I use CASE processing instead of a WHERE clause since my actual PROC SQL also processes other cases within the same query).

This gives me:

nr_distinct_C_IDs
3

How can I extend this to two variables (C_ID1 and C_ID2 in my example)?

Shenglin Chen Shenglin Chen · Accepted Answer · 2015-05-10T13:40:09

It is hard to extend this to two or more variables with your method. Try to stack variables first, then count distinct value. Like this:

proc sql;
   create table want as
   select count(ID) as nr_distinct_C_IDs from
       (select C_ID1 as ID from have
       union 
       select C_ID2 as ID from have)
    where not missing(ID);
quit;

PROC SQL - Counting distinct values across variables

2 Answers