1
votes

When running a data step in SAS, why does the output statement seem to 'stop' the iterating of the set statement?

I need to conditionally output duplicate observations. While I can use a plethora of output statements, I'd like if SAS did it's normal iterating and output just created an additional observation.

1) Does the run statement in SAS have a built in output statement? (The way sum statements have a built in retain)

2) What is happening when I ask SAS to output certain observations - in particular after a set statement? Will it set all the values until a condition and then only keep the values I request? or does it have some kind of similarities with other statements such as the point= statement?

3) Is there a similar statement to output that will continue to set the values from a previous data step and then output an additional observation when requested?

For example:

data test;
  do i = 1 to 100;
  output;
  end;
run;

data test2;
  set test;
  if _N_ in (4 8 11) then output;
run;

data test3;
  set test;
  if _N_ in (4 8 11) then output;
  output;
run;

test has 100 observations, test2 has 3 observations, and test3 has 103 observations. This make me think that there is some kind of built in output statement for either the run statement, or the data step itself.

4
Yes, it's fundamental to understanding data step programming that any data step that does not have an explicit output statement gets an implicit outputstatement right at the end. Whenever you add a conditional output, like in your tests, the implicit output is removed and you have to consider whether you need to put it back. See v8doc.sas.com/sashtml/lgref/z0194540.htm for more.Chris Long

4 Answers

3
votes

output in SAS is an explicit instruction to write out a row to the output dataset(s) (all of the dataset(s) named in the data statement, unless you specify a single dataset in output).

run, in addition to ending the step (meaning no statements after run are processed until that data step is completed - equivalent to the ending } in a c-style programming language module, basically) contains an implicit return statement.

Unless you are using link or goto, return tells SAS to return to the beginning of the data step loop. In addition, return contains an implicit output statement that outputs rows to all datasets named in the data statement, unless there is an output statement in the data step code - in which case that is not present.

It is return that causes SAS to actually stop processing things after it - not the output. In fact, SAS happily does things after the output statement; they just may not be output anywhere. For example:

data x;
 do row = 1 to 100;
  output;
  row_prev+1;
 end;
run;

That row_prev+1 statement is executed, even though it's after the output statement - its presence can be seen on the next row. In your example where you told it to just output three rows, it still processed the other 97 - just nothing was output from them. If any effects had happened from that processing, it would occur - in fact, the incrementing of _n_ is one of those effects (_n_ is not the row number, but the iteration count of data step looping).

You should probably read up on the data step itself. SAS documentation includes a lot of information on that, or you could read papers like The Essence of Data Step Programming. This sort of thing is quite common in SGF papers, in part because SAS certification requires understanding this fairly well.

1
votes

The best way to understand everything is by reading about the Program Data Vector (PDV). The short answer to your questions:

  • The output statement is implied at the run boundary of every SAS data step that uses set, merge, update, or (nothing).

  • The set statement takes the contents of the current row and reads them into the PDV, if you have a single set statement

  • The output statement simply outputs the contents of the PDV at that moment into your output dataset

  • SAS only goes to a new row in the source dataset defined by your set statement when it reaches a run boundary, delete statement, return statement, or failing the conditions of an if without then statement

  • point= forces SAS to go directly to an observation number defined by a variable; otherwise, it will read every row sequentially, one by one

0
votes

It's implicit at the end, unless it's used in one or more places in that data step.

Each time the execution encounters an OUTPUT statement, or the implicit one if it exists, it will output a new row.

0
votes

You are very close.

1) There is an implied OUTPUT at the end of the data step, unless your data step includes an explicit OUTPUT statement. That is why your first step wrote all 100 observations and the second only three.

2) The OUTPUT statement tells SAS to write the current record to the output dataset.

3) There is not a direct way to do what you want to duplicate records without using OUTPUT statements, but for some similar problems you can cause the duplication on the input side instead of the output side.

For example if you felt your class didn't have enough eleven year-olds you could make two copies of all eleven year-olds by reading them twice.

data want;
  set sashelp.class 
      sashelp.class(where=(age=11))
  ;
  by name;
run;