2
votes

I am using SAS for a large dataset (>20gb). When I run a DATA step, I received the "BY variables are not properly sorted ......" although I sorted the dataset by the same variables. When I ran the PROC SORT again, SAS even said "Input dataset is already sorted, No sorting done" My code is:

proc sort data=output.TAQ;
    by market ric date miliseconds descending type order;
run;

options nomprint;

data markers (keep=market ric date miliseconds type order);
    set output.TAQ;
    by market ric date;

    if first.date;

    * ie do the following once per stock-day;
    * Make 1-second markers;

    /*Type="AMARK"; Order=0; * Set order to zero to ensure that markers get placed before trades and quotes that occur at the same milisecond;
    do i=((9*60*60)+(30*60)) to (16*60*60); miliseconds=i*1000; output; end;*/
run;

And the error message was:

ERROR: BY variables are not properly sorted on data set OUTPUT.TAQ.
RIC=CXR.CCP Date=20160914 Time=13:47:18.125 Type=Quote Price=. Volume=. BidPrice=9.03 BidSize=400
AskPrice=9.04 AskSize=100 Qualifiers=  order=116458952 Miliseconds=49638125 exchange=CCP market=1
FIRST.market=0 LAST.market=0 FIRST.RIC=0 LAST.RIC=0 FIRST.Date=0 LAST.Date=1 i=. _ERROR_=1
_N_=43297873
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 43297874 observations read from the data set OUTPUT.TAQ.
WARNING: The data set WORK.MARKERS may be incomplete.  When this step was stopped there were
         56770826 observations and 6 variables.
WARNING: Data set WORK.MARKERS was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
      real time           1:14.21
      cpu time            26.71 seconds
3
Did you get an error message in the log when you ran proc sort?user667489
Definitely need to see more of the log. The observation counts are very strange - you have 'if first.date', so markers should be a subset of output.taq, however at the point where processing stops, ~43.3m obs have been read in from output.taq and ~56.8m have been written out to work.markers...keydemographic
@keydemographic There's an output statement inside a do-loop, so the obs counts may do all sorts of things.user667489
Try adding FORCE option to the PROC SORT. It's possible SAS only thinks it is sorted already. There are apparently situations where the SORTEDBY attribute of a dataset can be wrong.Quentin
Jump to observation #43297873 and check between +/- 5 observations from there to get an idea of why it would be saying it's not sorted in the right order. data check; set output.taq (firstobs=43297868 obs=43297878); run;Stu Sztukowski

3 Answers

1
votes

The error is occurring deep into your data step, at _N_=43297873. That suggests to me that the PROC SORT is working up to a point, but then fails. It is hard to know what the reason is without knowing your SAS environment or how OUTPUT.TAQ is stored.

Some people have reported resource problems or file system limitations when sorting large data sets.

From SAS FAQ: Sorting Very Large Datasets with SAS (not an official source):

  • When sorting in a WORK folder, you must have free storage equal to 4x the size of the data set (or 5x if under Unix)

  • You may be running out of RAM

  • You may be able to use options MSGLEVEL=i and FULLSTIMER to get a fuller picture

Also using options sastraceloc=saslog; can produce helpful messages.

Maybe instead of sorting it, you could break it up into a few steps, something like:

/* Get your market ~ ric ~ date pairs */
proc sql;
   create table market_ric_date as
   select distinct market, ric, date
   from output.TAQ
   /* Possibly an order by clause here on market, ric, date */
; quit;

data millisecond_stuff;
  set market_ric_date; 
  *Possibly add type/order in this step as well?;
  do i=((9*60*60)+(30*60)) to (16*60*60); miliseconds=i*1000; output; end;
run;

/* Possibly a third step here to add type / order if you need to get from original data source */
0
votes

If your source dataset is in a database, it may be sorted in a different collation.

Try the following before your sort:

options sortpgm=sas;
0
votes

I had the same error, and the solution was to make a copy of the original table in the work directory, do the sort, and then the "by" was working.

In your case something like below:

data tmp_TAQ;
    set output.TAQ;
run;

proc sort data=tmp_TAQ;
    by market ric date miliseconds descending type order;
run;

data markers (keep=market ric date miliseconds type order);
    set tmp_TAQ;
    by market ric date;

    if first.date;

    * ie do the following once per stock-day;
    * Make 1-second markers;

    /*Type="AMARK"; Order=0; * Set order to zero to ensure that markers get placed before trades and quotes that occur at the same milisecond;
    do i=((9*60*60)+(30*60)) to (16*60*60); miliseconds=i*1000; output; end;*/
run;