0
votes

I'm pulling data into SAS from a SQL server database with a (SAS DI) extract transformation. The data in the table goes back several years and has a little over 16 million rows; I only need data for the last few years which should amount to roughly 2.6 million rows.

So my extract uses a datetime filter. The autogenerated proc sql looks as follows (after obfuscating the libref, table, and column names):

%let SYSLAST = %nrquote(LIBREF.SRC_TBL); 

proc sql;
   create table work.WFJVX4PU as
      select
         col1,
         col2,
         dt_col,
         col3
   from &SYSLAST
      where dt_col >= &start_dt
   ;
quit;

The code works and returns the desired rows, but it takes far too long to run when compared with executing a similar query directly on the SQL server. When inspecting the statistics of the execution I discovered that the entire table was brought into SAS before the where clause was applied.

I remember reading that this will occur when attempting to use filters that apply functions to source data (a la datepart(dt_col) >= input("&start_date",date9.)) which is why I tried to pass the datetime used by the filter directly to the SQL engine. I also tried the "datepart" approach and got the same result.

Is there something else I should be doing to apply the filter server side before bringing the data into SAS? This isn't the experience I've had in the past when working with other database tables (e.g. Teradata, MySQL, Oracle, etc).

Additional Details:

The macro variable start_date is defined using the SAS DI prompt creation tool; the auto-generated code for this is

%let start_date_label = 730 days ago (July 02, 2017);
%let start_date_rel = D-730D;
%let start_date = 02Jul2017;

I then create the datetime macro variable start_dt by executing the following in the precode of the job:

%let start_dt = %sysfunc(dhms("&start_date"d,0,0,0));

Below is the libname statement (after obfuscation):

LIBNAME MYLIB SQLSVR  CONNECTION=SHARED  PRESERVE_TAB_NAMES=YES  dbconinit="set ansi_warnings off;use MY_DB_NAME;set nocount on;"  Datasrc=MY_DATA_SRC  SCHEMA=dbo  AUTHDOMAIN="my_sql_authentication" ;

If I modify the where clause to use a literal of the form '2019-06-25' then SAS throws the error

ERROR: Expression using greater than or equal (>=) has components that are of different data types.

because the dt_col field is of type numeric (format datetime22.3) on the SAS instance of the SQL server table that is registered from the library. If I use a literal as follows:

dt_col >= '25JUN2019'D

then I do get the desired set of rows (despite comparison between a datetime field and a date literal), but the query still takes a long time to execute and the job statistics indicate SAS is still grabbing all 16 million rows to perform this task.

UPDATE

I'm having issues following Tom's advice below. If I execute the following code:

LIBNAME MYLIB SQLSVR  CONNECTION=SHARED  PRESERVE_TAB_NAMES=YES  dbconinit="set ansi_warnings off;use MYDB;set nocount on;"  Datasrc=MYSRC  SCHEMA=dbo  AUTHDOMAIN="my_sql_authentication" ; 

/*---- Map the columns  ----*/ 
proc datasets lib = work nolist nowarn memtype = (data view);
   delete sql_psthru2;
quit;

proc sql;
   create table work.sql_psthru2 as
      select
         col1,
         col2,
         dt_col,
         col3
   from MYLIB.MYTBL
      where dt_col>= '25JUN2019'd
   ;
quit;

then I get data from the database, but if I execute

LIBNAME MYLIB SQLSVR  CONNECTION=SHARED  PRESERVE_TAB_NAMES=YES  dbconinit="set ansi_warnings off;use MYDB;set nocount on;"  Datasrc=MYSRC  SCHEMA=dbo  AUTHDOMAIN="my_sql_authentication" ;
proc sql;

  connect using MYLIB; 
  create table work.sql_psthru as
    select * from connection to MYLIB
    (select
     col1
    ,col2
    ,dt_col
    from MYTBL
    where dt_col >= '2019-06-25'
    )
   ;
quit;

then I receive the error

ERROR: CLI error trying to establish connection: [DataDirect][ODBC lib] Data source name not found and no default driver specified

immediately after the connect using MYLIB; line. I've tried many variants of the explicit passthru which I've found all over the internet which I won't post here, but none worked.

An interesting side note is that I believe the reason the SAS DI statistics are indicating that all 16 million rows are returned is that the extract transformation auto-generates the following macro:

%macro etls_recordCheck; 
   %let etls_recCheckExist = %eval(%sysfunc(exist(MYLIB.MYTBL, DATA)) or 
         %sysfunc(exist(MYLIB.MYTBL, VIEW))); 

   %if (&etls_recCheckExist) %then
   %do;
      proc sql noprint;
         select count(*) into :etls_recnt from MYLIB.MYTBL;
      quit;
   %end;
%mend etls_recordCheck;
%etls_recordCheck;

So the culprit in the long execution time is not that the full dataset is being returned to SAS (I removed the macro and the code still takes far too long to run).

1
Is connection to an ODBC driver? Please show the libname (redacting privileged info). Please also show the definition to &start_dt. What happens if you pass a literal, '2019-01-01' (non-date)?Parfait
Hey @Parfait. Just updated the question with the info you were requesting here.Rookatu
Did you trying push the query into the database explicitly? You will need to know how that database wants datetime constants.specified.Tom
Much easier to just use "&start_date:00:00:00"dt to specify the datetime constant with midnight for time part, but I doubt it will help with your issue.Tom
Hey @Tom. I have executed the analogue of the query on the SQL server directly and had much better performance. In terms of pushing that query to the SQL server from SAS, I'm not sure how to do that other than with this kind of a PROC SQL approach. It is atypical in my workflow to get data other than the method used here. Could you quickly describe how to do that? Also I will change the datetime specification to follow your advice, much appreciated!Rookatu

1 Answers

0
votes

You could try explicitly writing the code in the remote database. So if you already have a libref named LIBREF defined you can use that in the CONNECT statement in PROC SQL.

proc sql;
 connect using libref ;
 create table WFJVX4PU as
   select * from connection to libef
   (select
      col1
     ,col2
     ,dt_col
     ,col3
    from SRC_TBL
    where dt_col >= &start_dt
  )
 ;
quit;

Just make sure everything inside the () is valid syntax for that database system. Included the values of the macro variable START_DT.