4
votes

I am working with sas to manipulate some dataset. I am using the data step to apply some condition to keep columns and filter on some values. The problem is that I would like to filter on columns that in the end I will not need, so I would like to apply, first, the where clause, and then the keep clause. The problem is that sas executes, first the keep clause and then where, so when it is trying to apply the where instruction it doesn't find the columns on which it should be applied on. This is my code:

data newtable;
set mytable(where=(var1<value)
keep var2);
end;

In this case the error is the var1 cannot be found since I decided to keep only var2. I know I can do two data steps, but I would like to do everything in one step only. How can I achieve this?

2
You stated "apply some condition to keep columns". Does this mean your manipulations will also be examining the values in the data set in order to determine which variable(s) to keep ?Richard
+1, this always bugs me. The DROP/KEEP option is used to define which variables are read into the PDV. The WHERE option is used to subset data before a record is read into the PDV (right?). Definitely feels like you should be able to reference variables in the WHERE option that will not be read into the PDV. A hack workaround is to make a view that implements the WHERE option, then read the view with a KEEP option, but ugh.Quentin

2 Answers

3
votes

This can be achieved by using the keep data set option on the output data set, e.g.(untested):

data newtable(keep=var2);
  set mytable(where=(var1<value));
end;

Alternatively a keep statement can be used, e.g. (untested):

data newtable;
  set mytable(where=(var1<value));
  keep var2;
end;
1
votes

@Amir has the right of it. @Quentin is worried about the efficiency of it all. Unless your where clause is highly involved, then this will be your most efficient method.

data newtable;
  set mytable(where=(var1<value) keep=var1 var2);
  keep var2;
end;

Yes, var1 is read into the PDV, but you have limited the PDV to only the variables wanted in the output and needed in the where clause.

If you just want to run a where and do no other data step logic, then PROC SQL is just as efficient as the method above.

proc sql noprint;
create table newtable as
select var2
   from mytable
   where var1<value;
quit;