2
votes

I could not find information about this problem, or could not specify the question correctly.

Let me ask the question with code:
Is this operation

data work.tmp;
    set work.tmp;
    * some changes to data here;
run;

or especially

proc sort data = work.tmp out = work.tmp;
    by x;
run;

dangerous in any way, or considered a bad practice in SAS? Note the same input and output dataset names, which is my main point. Does SAS handle this situation correctly so there would be no ambiguous results with running this kind of data step/procedure?

2

2 Answers

3
votes

The latter, sorting into itself, is done fairly frequently; as sort is just re-arranging the dataset, and (unless you are depending on the order being otherwise, or unless you use a where clause to filter the dataset or rename/keep/drop options) doesn't do any permanent harm to the dataset, it's not considered bad practice, as long as tmp is in work (or a libname intended to be used as a working directory). SAS creates a temporary file to do the sort, and when it's successful it deletes the old one and renames the temporary file; no substantial risk of corruption.

The former, setting a dataset to itself in a data step, is usually not considered a good practice. That's because a data step often does something irreversible - ie, if you run it once it has a different result than if you run it again. Thus, you risk not knowing what status your datastet has; and while with sort you can rely on knowing because you get an obvious error if it's not properly sorted most of the time, with the data step you might never know. As such, each data step should generally produce a new dataset (at least, new to that thread). There are times when it's necessary to do this, or at least would be substantially wasteful not to - perhaps a macro that sometimes does a long data step and sometimes doesn't - but usually you can program around it.

It's not dangerous in the sense that the file system will get confused, though; similar to sort, SAS will simply create a temporary file, fill the new dataset, then delete the old one and rename the temporary file.

(I leave aside mention of things like modify which must set a dataset to itself, as that has an obvious answer...)

2
votes

Some examples of why this is not considered good practice. Say you're working interactively, and you have the following code dataset named tmp:

data tmp;
  set sashelp.class;
run;

If you were to run the below code twice, it would run fine the first time, but on the second run you would receive a warning as the variable age no longer exists on that dataset:

data tmp;
  set tmp;
  drop age;
run;

In this case, it's a pretty harmless example, and you are lucky enough that SAS is simply giving a warning. Depending on what the data step was doing though, it could just just as easily have been something that generates an error, e.g.:

data tmp;
  set tmp (rename=(age=blah));
run;

Or even worse, it may generate no ERROR or WARNING, and change the expected results like the below code:

data tmp;
  set tmp;
  weight = log(weight);
run;

Our intention is to apply a simple log transformation to the weight variable in preparation for modeling, but if we accidentally run the step a second time, we are calculating the log(log(weight)). No warnings or errors will be given and looking at the dataset it will not be immediately obvious that anything is wrong.

IMO, you are much better off creating iterative datasets, ie. tmp1, tmp2, tmp3, and so on... for every process that updates the dataset in some way. Space is much cheaper than spending time debugging.