SAS Array to Proc SQL

Question

In a very large insurance claims dataset, I have to flag the observations as hiv_screen if any of the diagnostic codes ICD1 to ICD25 is either '79571' or 'R75' and likewise for other flags.

The data steps takes I wrote:

data inpatient_claims_2018 (keep=ID, ICD1-ICD25, hiv_screen, ...);
   set cl2018.ip_claims;

   array diag[25] ICD1-ICD25;
   do i to 25;
       if diag(i) in: ('79571', 'R75') then hiv_screen=1;
       if diag(i) in: ......
   end;

   *retain those that were flagged;
   if hiv_screen=1 or ...=1;
run;

takes a long time and I hope using proc SQL will be faster. How would I best convert the SAS array to a SQL statement?

Why do you think changing to SQL is going to have any impact on the speed? Do you mean that you moved the data into some large external database that can process it faster than the SAS server can? — Tom
How large is very large ? What is the LIBNAME CL2018 statement you are using ? What data base are you connecting to? Is it local network or a local/remote network data base ? What engine are you using ? — Richard
Why are you flagging records and not selecting them? Which percentage would you expect will be flagged? Do you need most of the fields on your dataset, or only some? — Dirk Horsten
I am analyzing medicare claims, so millions of claims. I'm not sure to be exact because there are so many files that I need to merge.Yes, I select the variable I needed in the dataset and retain observations that I flagged (thanks for catching that, I will edit that). — easybreezy
For using Proc SQL over SAS data steps, I read that Proc SQL supports explicit passthrough to the database and the DATA step sequentially reads each observation, so PROC SQL may be an advantage when dealing with large dateset — easybreezy

Dirk Horsten Dirk Horsten · Accepted Answer · 2020-04-20T14:01:49

Before we start

As you call your flag hiv_screen, I assume you actually only need the observations you will flag. If so, you better select them instead of flagging them.
If you will only use the results once, it is better to create a view than a table. This way you avoid a lot of disk access.
All logic in a data step is applied after building the variable vector, except that in the where statement and I assume the same is true for the where clause in a SQL query. Therefore, it is faster to select data in a where statement/clause than to flag them.
Data steps are not intrinsically slower than SQL queries. If they are, you don't use them well. For instance, with a keep statement, you select which variable you want and don't write others to disk anymore.

Macro's as an alternative for arrays

A macro is very comparable to pre-compiler code in, for instance C++. It adds statements to your program.

%macro any_ICD_in (string_list, ICD_count);
    ICD1 in (&string_list)
    %do icd_nr = 2 %to &ICD_count;
        or ICD%icd_nr in (&string_list)
    %end;
%mend;

%* test the macro;
%put %any_ICD_in('79571', 'R75');

Using the macro

in a data step

data inpatient_claims_2018;
   set cl2018.ip_claims;

   hiv_screen = %any_ICD_in('79571', 'R75');
   covid19_screen = %any_ICD_in('I', 'dont', 'know');
   ...
run;

Improving performance by only reading relevant data

in a data step

data inpatient_claims_2018;
   set cl2018.ip_claims;

   where %any_ICD_in('79571', 'R75')
      or %any_ICD_in('I', 'dont', 'know')
      or ...
      ;

   hiv_screen = %any_ICD_in('79571', 'R75');
   covid19_screen = %any_ICD_in('I', 'dont', 'know');
   ...

   keep hiv_screen covid19_screen ... <and all other variable you really need>;
run;

Alternatively avoid writing the data to disk at all

in a data step

data inpatient_claims_2018 / view=inpatient_claims_2018;
   set cl2018.ip_claims;

   hiv_screen = %any_ICD_in('79571', 'R75');
   covid19_screen = %any_ICD_in('I', 'dont', 'know');
   ...
run;

I will not take the time to add SQL solutions today, but I might tomorrow.

SAS Array to Proc SQL

1 Answers

Before we start

Macro's as an alternative for arrays

Using the macro

in a data step

Improving performance by only reading relevant data

in a data step

Alternatively avoid writing the data to disk at all

in a data step