4
votes

The problem I am trying to solve:

I have a SAS dataset work.testData (in the work library) that contains 8 columns and around 1 million rows. All columns are in text (i.e. no numeric data). This SAS dataset is around 100 MB in file size. My objective is to have a step to parse this entire SAS dataset into Oracle. i.e. sort of like a "copy and paste" of the SAS dataset from the SAS platform to the Oracle platform. The rationale behind this is that on a daily basis, this table in Oracle gets "replaced" by the one in SAS which will enable downstream Oracle processes.

My approach to solve the problem:

One-off initial setup in Oracle:

  1. In Oracle, I created a table called testData with a table structure pretty much identical to the SAS dataset testData. (i.e. Same table name, same number of columns, same column names, etc.).

On-going repeating process:

  1. In SAS, do a SQL-pass through to truncate ora.testData (i.e. remove all rows whilst keeping the table structure). This ensure the ora.testData is empty before inserting from SAS.
  2. In SAS, a LIBNAME statement to assign the Oracle database as a SAS library (called ora). So I can "see" what's in Oracle and perform read/update from SAS.
  3. In SAS, a PROC SQL procedure to "insert" data from the SAS dataset work.testData into the Oracle table ora.testData.

Sample codes

One-off initial setup in Oracle:

Step 1: Run this Oracle SQL Script in Oracle SQL Developer (to create table structure for table testData. 0 rows of data to begin with.)

DROP TABLE testData;
CREATE TABLE testData
  (
    NODENAME          VARCHAR2(64) NOT NULL,
    STORAGE_NAME      VARCHAR2(100) NOT NULL,
    TS                VARCHAR2(10) NOT NULL,
    STORAGE_TYPE      VARCHAR2(12) NOT NULL,
    CAPACITY_MB       VARCHAR2(11) NOT NULL,
    MAX_UTIL_PCT      VARCHAR2(12) NOT NULL,
    AVG_UTIL_PCT      VARCHAR2(12) NOT NULL,
    JOBRUN_START_TIME VARCHAR2(19) NOT NULL
  )
;
COMMIT;

On-going repeating process:

Step 2, 3 and 4: Run this SAS code in SAS

******************************************************;
******* On-going repeatable process starts here ******;
******************************************************;

*** Step 2: Trancate the temporary Oracle transaction dataset;
proc sql;
  connect to oracle (user=XXX password=YYY path=ZZZ);
  execute (
    truncate table testData
  ) by oracle;
  execute (
    commit
  ) by oracle;
  disconnect from oracle;
quit;

*** Step 3: Assign Oracle DB as a libname;
LIBNAME ora Oracle user=XXX password=YYY path=ZZZ dbcommit=100000;

*** Step 4: Insert data from SAS to Oracle;
PROC SQL; 
  insert into ora.testData
  select NODENAME length=64,
         STORAGE_NAME length=100,
         TS length=10,
         STORAGE_TYPE length=12,
         CAPACITY_MB length=11,
         MAX_UTIL_PCT length=12,
         AVG_UTIL_PCT length=12,
         JOBRUN_START_TIME length=19
  from work.testData; 
QUIT;

******************************************************;
**** On-going repeatable process ends here       *****;
******************************************************;

The limitation / problem to my approach:

The Proc SQL step (that transfer 100 MB of data from SAS to Oracle) takes around 5 hours to perform - the job takes too long to run!

The Question:

Is there a more sensible way to perform data transfer from SAS to Oracle? (i.e. updating an Oracle table from SAS).

1

1 Answers

6
votes

First off, you can do the drop/recreate from SAS if that's a necessity. I wouldn't drop and recreate each time - a truncate seems easier to get the same results - but if you have other reasons then that's fine; but either way you can use execute (truncate table xyz) from oracle or similar to drop, using a pass-through connection.

Second, assuming there are no constraints or indexes on the table - which seems likely given you are dropping and recreating it - you may not be able to improve this, because it may be based on network latency. However, there is one area you should look in the connection settings (which you don't provide): how often SAS commits the data.

There are two ways to control this, the DBCOMMMIT setting and the BULKLOAD setting. The former controls how frequently commits are executed (so if DBCOMMIT=100 then a commit is executed every 100 rows). More frequent commits = less data is lost if a random failure occurs, but much slower execution. DBCOMMIT defaults to 0 for PROC SQL INSERT, which means just make one commit (fastest option assuming no errors), so this is less likely to be helpful unless you're overriding this.

Bulkload is probably my recommendation; that uses SQLLDR to load your data, ie, it batches the whole bit over to Oracle and then says 'Load this please, thanks.' It only works with certain settings and certain kinds of queries, but it ought to work here (subject to other conditions - read the documentation page above).

If you're using BULKLOAD, then you may be up against network latency. 5 hours for 100 MB seems slow, but I've seen all sorts of things in my (relatively short) day. If BULKLOAD didn't work I would probably bring in the Oracle DBAs and have them troubleshoot this, starting from a .csv file and a SQL*LDR command file (which should be basically identical to what SAS is doing with BULKLOAD); they should know how to troubleshoot that and at least be able to monitor performance of the database itself. If there are constraints on other tables that are problematic here (ie, other tables that too-frequently recalculate themselves based on your inserts or whatever), they should be able to find out and recommend solutions.

You could look into PROC DBLOAD, which sometimes is faster than inserts in SQL (though all in all shouldn't really be, and is an 'older' procedure not used too much anymore). You could also look into whether you can avoid doing a complete flush and fill (ie, if there's a way to transfer less data across the network), or even simply shrinking the column sizes.