2
votes

While I've read quite a bit about conceptualizing the Program Data Vector when using a SAS data step, I still don't understand how the PDV works when there is by group processing. For example if I have the dataset olddata

 GROUP   VAL
 A       10
 A        5
 B       20

And I call a datastep on it with a by statement, such as:

data newdata;
set olddata;
by group;
...
run;

then the compiler adds two temporary variables to the PDV: first.group and last.group. When you read any tutorial on the PDV it will tell you that on the first pass of the SET statement, the PDV will look like:

_N_    _ERROR_   FIRST.GROUP   LAST.GROUP   GROUP   VAL
  1          0             1            0       A    10

and that LAST.GROUP is zero because observation 1 is not the last observation in group A.

Herein lies my question: How does SAS know that this is not the last observation?

If SAS is processing olddata row-by-row, how is the PDV aware that the next row holds another group A observation instead of a new group? In other words, it seems like SAS must be using information from previous or future rows to update the FIRST and LAST variables, but I'm not sure how. Is there some trick in how the PDV retains values from row to row when the BY statement is called?

1

1 Answers

1
votes

SAS actually looks ahead to the next record to see if it should set LAST.(var) or not. I haven't been able to find an article explaining that in any detail, unfortunately. I was a bit disappointed to see that even papers like http://www.wuss.org/proceedings09/09WUSSProceedings/papers/ess/ESS-Li1.pdf just gloss over how LAST is detemined.

SAS also looks ahead to see if the END= variable should be set, when specified, and a few other things. It's not just using metadata to determine those; you can remove or modify records without modifying the metadata, and it will still work - and SQL tables that don't have the usual SAS metadata will still allow you to perform normal BY group processing and such.

The FIRST variable doesn't need a look-behind, of course; it remembers where it was after all.

Edit: I crossposted this to SAS-L, and got the same answer - there doesn't seem to be any documentation of the subject, but it must read ahead. See http://listserv.uga.edu/cgi-bin/wa?A1=ind1303a&L=sas-l#8 for example.

Edit2: From SAS-L, Dan Nordlund linked to a paper that confirms this. http://support.sas.com/resources/papers/proceedings12/222-2012.pdf

The paper's logic that confirms the lookahead - look at the number of observations read from the data set.

DATA DS_Sample1;          
Input Sum_Var 
Product;     
Cards;                  
100 3                   
100 2                   
100 1                   
;                 
*With BY statement - reads 3 observations even though it stops after 2.;
DATA DS_Sample2;    
  Set DS_Sample1;           
  by Sum_Var;               
  cnt+1; If CNT > 1 then stop;
Run;
*no BY statement - reads 2 observations as expected;
DATA DS_Sample2;    
  Set DS_Sample1;           
  cnt+1; If CNT > 1 then stop;
Run;
* END statement - again, a lookahead;
DATA DS_Sample2;    
  Set DS_Sample1 end=eof;
  cnt+1; If CNT > 1 then stop;
Run;