6
votes

Consider the following example:

/* Create two not too interesting datasets: */
Data ones (keep = A);
Do i = 1 to 3;
A = 1;
output;
End;
run;

Data numbers;
Do B = 1 to 5;
output;
End;
Run;

/* The interesting step: */
Data together;
Set ones numbers;
if B = 2 then A = 2;
run;

So dataset ones contains one variable A with 3 observations, all ones and dataset numbers contains one variable (B) with 5 observations: the numbers 1 to 5. I expect the resulting dataset together to have two columns (A and B) and the A column to read (vertically) 1, 1, 1, . , 2, . , . , .

However, when executing the code I find that column A reads 1, 1, 1, . , 2, 2, 2 , 2

Apparently the 2 created in the fifth observation is retained all the way down for no apparent reason. What is going on here?

(For the sake of completeness: when I split the last data step into two as below:

Data together;
set ones numbers;
run;
Data together;
set together;
if B = 2 then A = 2;
run;

it does do what I expect.)

1

1 Answers

7
votes

Yes, any variable that is defined in a SET, MERGE, or UPDATE statement is automatically retained (not set to missing at the top of the data step loop). You can effectively ignore that with

output;
call missing(of <list of variables to clear out>);
run;

at the end of your data step.

This is how MERGE works for many-to-one merges, by the way, and the reason that many-to-many merges don't usually work the way you want them to.


The difference between the 'together' and the 'separate' cases is that in the separate case, you have two data sets with different variables. If you are running this in interactive mode, ie SAS Program Editor or Enhanced Editor (not EG or batch mode), you can use the data step debugger to see this a little more clearly. You would see the following:

At the end of the last row of the ones dataset:

i A B
3 1 .

Notice B exists, but is missing. Then it goes back to the top of the data step loop. All three variables are left alone since they're all from the data sets. Then it attempts to read from ones one more time, which generates:

i A B
. . .

Then it realizes it cannot read from ones, and starts to read from numbers. At the end of the first row of the numbers dataset:

i A B
. . 1

Then it goes to the top, again changes nothing; then it reads in a 2 for B.

i A B
. . 2

Then it sets A to 2, per your program:

i A B
. 2 2

Then it returns to the start of the data step loop again.

i A B
. 2 2

Then it reads in B=3:

i A B
. 2 3

Then it continues looping, for B=4, 5.

Now, compare that to the single dataset. It will be nearly the same (with a small difference at the switch between datasets that does not yield a different result). Now we go to the step where A=2 B=2:

i A B
. 2 2

Now when the data step reads in the next row, it has all three variables on it. So it yields:

i A B
. . 3

Since it read in A=. from the row, it is setting it to missing. In the one-data-step version, it didn't have a value for A to read in, so it didn't replace the 2 with missing.