5
votes

I've also posted this question on runsubmit.com, a site outside the SE network for SAS-related questions.

At work there are 2 sas servers I use. When I transfer a sas dataset from one to the other via proc upload, it goes at about 2.5MB/s. However, if I map the drive on one server as a network drive and copy and paste the file across, it runs much faster, around 80MB/s (over the same gigabit connection).

Could anyone suggest what might be causing this and what I can do either to fix it or as a workaround?

There is also a third server I use that cannot map network drives on the other two- SAS is the only available means of transferring files from that one, so I need a SAS-based solution. Although individual transfers from this one run at 2.5MB/s, I've found that it's possible to have several transfers all going in parallel, each at 2.5MB/s.

Would SAS FTP via filenames and a data step be any faster than using proc upload? I might try that next, but I would prefer not to use this- we only have SAS 9.1.3, so SFTP isn't available.

Update - Further details:

  • I'm connecting to a spawner, and I think it uses 'SAS proprietary encryption' (based on what I recall seeing in the logs).
  • The uploads are Windows client -> Windows remote in the first case and Unix client -> Windows remote in the second case.
  • The SAS datasets in question are compressed (i.e. by SAS, not some external compression utility).
  • The transfer rate is similar when using proc upload to transfer external files (.bz2) in binary mode.
  • All the servers have very fast disk arrays handled by enterprise-grade controllers (minimum 8 drives in RAID 10)

Potential solutions

  • Parallel PROC UPLOAD - potentially fast enough, but extremely CPU-heavy
  • PROC COPY - much faster than PROC UPLOAD, much less CPU overhead
  • SAS FTP - not secure, unknown speed, unknown CPU overhead

Update - test results

  • Parallel PROC UPLOAD: involves quite a lot of setup* and a lot of CPU, but works reasonably well.
  • PROC COPY: exactly the same transfer rate per session as proc upload, and far more CPU time used.
  • FTP: About 20x faster, minimal CPU (100MB/s vs. 2.5MB/s per parallel proc upload).

*I initially tried the following:

local session -> remote session on source server -> n remote sessions on destination server -> Recombine n pieces on destination server

Although this resulted in n simultaneous transfers, they each ran at 1/n of the original rate, probably due to a CPU bottleneck on the source server. To get it to work with n times the bandwidth of a single transfer, I had to set it up as:

local session -> n remote sessions on source server -> 1 remote session each on destination server -> Recombine n pieces on destination server

SAS FTP code

filename source ftp '\dir1\dir2'
host='servername'
binary dir
user="&username" pass="&password";

let work = %sysfunc(pathname(work));
filename target "&work";
data _null_;
infile source('dataset.sas7bdat') truncover;
input;
file target('dataset.sas7bdat');
put _infile_;
run;
2
Please update your question with details on the SAS server environments and how you CONNECT, especially if you are connecting to a CONNECT Spawner or or some other method. If using a Spawner, find out if it is using encryption.BellevueBob
Question updated - what other specific details would be useful?user667489
Are the SAS data sets you are uploading compressed? And I'm guessing everything is Windows, correct? And when you say you are copying from one server to another, do you mean you are connecting to server B with a SAS/CONNECT session from server A?BellevueBob
I've noticed this before as well when I was working at my previous company. We stopped using proc upload and started using FTP to transfer datasets. I just figured there's a lot of overhead associated with it much like using a proc to duplicate a dataset. Doing it in SAS will take many times longer than just copying it via the OS.Robert Penridge

2 Answers

5
votes

My understanding of PROC UPLOAD is that it is performing a record-by-record upload of the file along with some conversions and checks, which is helpful in some ways, but not particularly fast. PROC COPY, on the other hand, will happily copy the file without being quite as careful to maintain things like indexes and constraints; but it will be much faster. You just have to define a libref for your server's files.

For example, I sign on to my server and assign it the 'unix' nickname. Then I define a library on it: libname uwork server=unix slibref=work;

Then I execute the following PROC COPY code, using a randomly generated 1e7 row datafile. Following that, I also RSUBMIT a PROC UPLOAD for comparison purposes.

48   proc copy in=work out=uwork;
NOTE: Writing HTML Body file: sashtml.htm
49   select test;
50   run;

NOTE: Copying WORK.TEST to UWORK.TEST (memtype=DATA).
NOTE: There were 10000000 observations read from the data set WORK.TEST.
NOTE: The data set UWORK.TEST has 10000000 observations and 1 variables.
NOTE: PROCEDURE COPY used (Total process time):
      real time           13.07 seconds
      cpu time            1.93 seconds


51   rsubmit;
NOTE: Remote submit to UNIX commencing.
3    proc upload data=test;
4    run;


NOTE: Upload in progress from data=WORK.TEST to out=WORK.TEST
NOTE: 80000000 bytes were transferred at 1445217 bytes/second.
NOTE: The data set WORK.TEST has 10000000 observations and 1 variables.
NOTE: Uploaded 10000000 observations of 1 variables.
NOTE: The data set WORK.TEST has 10000000 observations and 1 variables.
NOTE: PROCEDURE UPLOAD used:
      real time           55.46 seconds
      cpu time            42.09 seconds


NOTE: Remote submit to UNIX complete.

PROC COPY is still not quite as fast as OS copying, but it's much closer in speed. PROC UPLOAD is actually quite a bit slower than even a regular data step, because it's doing some checking; in fact, here the data step is comparable to PROC COPY due to the simplicity of the dataset (and probably the fact that I have a 64k block size, meaning that a data step is using the server's 16k block size while PROC COPY presumably does not).

52   data uwork.test;
53   set test;
54   run;

NOTE: There were 10000000 observations read from the data set WORK.TEST.
NOTE: The data set UWORK.TEST has 10000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           12.60 seconds
      cpu time            1.66 seconds

In general in 'real world' situations, PROC COPY is faster than a data step, but both are faster than PROC UPLOAD - unless you need to use proc upload because of complexities in your situation (I have never seen a reason to, but I know it is possible). I think that PROC UPLOAD was more necessary in older versions of SAS but is largely unneeded now, but given my experience is fairly limited in terms of hardware setups this may not apply to your situation.

0
votes

FTP, if available from the source server, is much faster than proc upload or proc copy. These both operate on a record-by-record basis and can be CPU-bound over fast network connections, especially for very wide datasets. A single FTP transfer will attempt to use all available bandwidth, with negligible CPU cost.

This assumes that the destination server can use the unmodified transferred file - if not, the time required to make it usable might negate the increased transfer speed of FTP.