3
votes

I am a DBA / R user. I just took a job in an office full of SAS users and I am trying to understand better how SAS' proc sql works. I understand that SAS includes a relational database and it includes the ability to run proc sql against external servers like Oracle. I am trying to better understand when / how it decides to use the database server rather than its internal database system.

I have seen some really S. L. O. W. SAS code where my coworkers running a series of proc sql commands. These programs typically include 3 - 5 proc sql steps. Each proc sql command creates a local SAS table. They are not using passthrough sql. The data sets are large (1 million rows +) and these proc sql steps run slowly. Most of the data lives on the server. There is usually a small table that defines the population that we want to look at and it is in a SAS data file, but everything else lives on the server.

I have demonstrated dramatic improvements in speed by simply running all of the queries directly on the server. (Oracle in this case, but I don't think that is important.) Usually, I have to first upload a table to my personal schema that defines the population of clients we want to examine. Everything else is on the server. Sometimes I collapse their queries together because they can be done in a single step, but I do not believe that is why my version of their program is so much faster.

I think proc sql uploads the initial data set and then runs the first query on the server. It then downloads the output to the local computer, creating the local SAS data set. For the second proc sql step, it uploads the table created in step one back to the server and then runs the query on the server. To make this all even worse, the "local" SAS data sets are actually stored on a remote server, not the actual local machine. This is invisible to SAS, but it does mean we are copying data across the network yet again. I believe SAS is running slowly because of a large amount of unnecessary network traffic.

Question #1 - Is my understanding of what proc sql is doing correct? Are we really wasting as much time as I think we are uploading and downloading large tables / data sets across our network?

Qeustion #2 - Is there some way to control when proc sql runs against a server versus when it runs against the local database? In some cases, if we could prevent the upload / download step, the query would run more efficiently.

1

1 Answers

9
votes

Short answer

Your understanding is not exactly correct, but it's in the right ballpark. SQL is probably not sending the SAS dataset to the server, it is more likely downloading the server data to SAS - but it's probably downloading the entire table, not limited by the join criteria. Your solution is exactly what I would suggest doing - hopefully your colleagues will get on board.


Long answer

In terms of how the processing works, it depends on your code. PROC SQL will execute code locally (as in, on the SAS server/desktop), unless it decides to pass the query up to the server and hasn't been told it's not allowed to. That's called implicit passthrough. You can't really control it except to turn it entirely off (with noipassthru on the PROC SQL statement). You can look at it sometimes using options msglevel=i; (a system option), and _METHOD or _TREE to see what SQL decided to do (similar to explain plan).

I've had cases where it caused harm: SQL Server runs character comparisons case-insensitively while SAS does not, and I had a particular query that sometimes was sent up to the server and sometimes not depending on details of the data. I wasn't careful enough with checking case, and so it appeared to work when it really wasn't correct (comparing Propcase to UPCASE).

The general rule is that SAS will try to send the query to the server if:

  • The data in the query entirely already resides on the server
  • The query is sufficiently simple that SAS can easily figure out how to tell the server to do it, in its native language

If you're running a query with local SAS dataset (say, joining a server table to a SAS dataset locally), it won't (at least as far as I know) go to the server. It should always run it locally, which would mean downloading from the server all data in the contributing tables (possibly filtered if there is a logical filter in the query). IE (these examples aren't necessarily good SQL code, just examples of concept):

libname oralib oracle [connection info];
proc sql;
*Will pass through likely;
select tableA.*, tableB.cost 
  from oralib.tableA inner join oralib.tableB 
  on tableA.id=tableB.id;
*Will probably not pass through;
select tableA.*, tableB.cost 
  from oralib.tableA inner join work.tableB
  on tableA.id=tableB.id;
*Might pass through, might not;
select tableA.*, tableB.cost, tableC.productID
  from oralib.tableA inner join oralib.tableB 
  on tableA.id=tableB.id
  left join oralib.tableC
  on tableA.id=tableC.id;
*This downloads the data but probably applies the where statement server side;
select tableA.*, tableB.cost 
  from oralib.tableA inner join work.tableB
  on tableA.id=tableB.id
  where tableA.date < '01JAN2010'd;
quit;

In the case of the second query, it probably pulls all of tableA down. In the fourth query, it likely will pass the where clause to the server (assuming the date doesn't cause a problem, but it shouldn't, SAS knows how to convert dates to oracle type dates).

Note that SAS procs can also generate passthrough. PROC MEANS, etc., will send the instructions to Oracle to do the means/sums/etc. if it can easily do so.

Your best bet is to:

  1. Try to do everything in pass through that you can (and that makes sense). Only way to be sure it goes to the server is to use passthrough.
  2. If you have a large table on the server and a small table in SAS, upload the table in SAS to the server. A passthrough session and a libname session can't see each others session-specific temporary tables, so you'd have to use a GTT or similar (something all users can see). Similarly, if you have a large table in SAS and a small table (or small query result) in SQL, bring it down locally (through passthrough if necessary).
  3. When you do have to bring things down, limit as much as possible. When I worked in that kind of environment, I made huge time savings simply by joining to tables on the server to limit my result set before bringing them down.

At the end of the day, you will be constrained by network traffic no matter what you do; just try to optimize it as best you can. It sounds like you understand how to do that already, so just do what you normally would do in non-SAS environments.