How does one or more data step OUTPUT statements work and can it be implicit?

Question

When running a data step in SAS, why does the output statement seem to 'stop' the iterating of the set statement?

I need to conditionally output duplicate observations. While I can use a plethora of output statements, I'd like if SAS did it's normal iterating and output just created an additional observation.

1) Does the run statement in SAS have a built in output statement? (The way sum statements have a built in retain)

2) What is happening when I ask SAS to output certain observations - in particular after a set statement? Will it set all the values until a condition and then only keep the values I request? or does it have some kind of similarities with other statements such as the point= statement?

3) Is there a similar statement to output that will continue to set the values from a previous data step and then output an additional observation when requested?

For example:

data test;
  do i = 1 to 100;
  output;
  end;
run;

data test2;
  set test;
  if _N_ in (4 8 11) then output;
run;

data test3;
  set test;
  if _N_ in (4 8 11) then output;
  output;
run;

test has 100 observations, test2 has 3 observations, and test3 has 103 observations. This make me think that there is some kind of built in output statement for either the run statement, or the data step itself.

Yes, it's fundamental to understanding data step programming that any data step that does not have an explicit output statement gets an implicit outputstatement right at the end. Whenever you add a conditional output, like in your tests, the implicit output is removed and you have to consider whether you need to put it back. See v8doc.sas.com/sashtml/lgref/z0194540.htm for more. — Chris Long

Joe Joe · Accepted Answer · 2017-01-30T15:58:19

output in SAS is an explicit instruction to write out a row to the output dataset(s) (all of the dataset(s) named in the data statement, unless you specify a single dataset in output).

run, in addition to ending the step (meaning no statements after run are processed until that data step is completed - equivalent to the ending } in a c-style programming language module, basically) contains an implicit return statement.

Unless you are using link or goto, return tells SAS to return to the beginning of the data step loop. In addition, return contains an implicit output statement that outputs rows to all datasets named in the data statement, unless there is an output statement in the data step code - in which case that is not present.

It is return that causes SAS to actually stop processing things after it - not the output. In fact, SAS happily does things after the output statement; they just may not be output anywhere. For example:

data x;
 do row = 1 to 100;
  output;
  row_prev+1;
 end;
run;

That row_prev+1 statement is executed, even though it's after the output statement - its presence can be seen on the next row. In your example where you told it to just output three rows, it still processed the other 97 - just nothing was output from them. If any effects had happened from that processing, it would occur - in fact, the incrementing of _n_ is one of those effects (_n_ is not the row number, but the iteration count of data step looping).

You should probably read up on the data step itself. SAS documentation includes a lot of information on that, or you could read papers like The Essence of Data Step Programming. This sort of thing is quite common in SGF papers, in part because SAS certification requires understanding this fairly well.

How does one or more data step OUTPUT statements work and can it be implicit?

4 Answers