I am a DBA / R user. I just took a job in an office full of SAS users and I am trying to understand better how SAS' proc sql works. I understand that SAS includes a relational database and it includes the ability to run proc sql against external servers like Oracle. I am trying to better understand when / how it decides to use the database server rather than its internal database system.
I have seen some really S. L. O. W. SAS code where my coworkers running a series of proc sql commands. These programs typically include 3 - 5 proc sql steps. Each proc sql command creates a local SAS table. They are not using passthrough sql. The data sets are large (1 million rows +) and these proc sql steps run slowly. Most of the data lives on the server. There is usually a small table that defines the population that we want to look at and it is in a SAS data file, but everything else lives on the server.
I have demonstrated dramatic improvements in speed by simply running all of the queries directly on the server. (Oracle in this case, but I don't think that is important.) Usually, I have to first upload a table to my personal schema that defines the population of clients we want to examine. Everything else is on the server. Sometimes I collapse their queries together because they can be done in a single step, but I do not believe that is why my version of their program is so much faster.
I think proc sql uploads the initial data set and then runs the first query on the server. It then downloads the output to the local computer, creating the local SAS data set. For the second proc sql step, it uploads the table created in step one back to the server and then runs the query on the server. To make this all even worse, the "local" SAS data sets are actually stored on a remote server, not the actual local machine. This is invisible to SAS, but it does mean we are copying data across the network yet again. I believe SAS is running slowly because of a large amount of unnecessary network traffic.
Question #1 - Is my understanding of what proc sql is doing correct? Are we really wasting as much time as I think we are uploading and downloading large tables / data sets across our network?
Qeustion #2 - Is there some way to control when proc sql runs against a server versus when it runs against the local database? In some cases, if we could prevent the upload / download step, the query would run more efficiently.