1
votes

I have a question in creating a count variable using SAS.

 Q R   
 ---- 
 1 a    
 1 a  
 1 b  
 1 b  
 1 b  
 2 a   
 3 a   
 3 c  
 4 c  
 4 c  
 4 c

I need to create a variable S that counts the rows that has same combination of Q and R. The following will be the output.

 Q       R       S
 -------------------  
 1       a       1  
 1       a       2  
 1       b       1  
 1       b       2  
 1       b       3*  
 2       a       1  
 3       a       1  
 3       c       1   
 4       b       1  
 4       b       2  
 4       b       3  

I tried using following program:

data two;
set one;
S + 1;
by Q R;
if first.Q and first.R then S = 1;
run;

But, this did not run correctly. For example, * will come out as 1 instead of 3. I would appreciate any tips on how to make this counting variable work correctly.

2

2 Answers

3
votes

Very close, your if statement is should be first.R (or change the and to OR but that isn't efficient). I usually prefer to have the increment after the set to 1.

data two;
set one;  
by Q R;
*Retain S; *implicitly retained by using the +1 notation;
if first.R then S = 1;
else S+1;
run;
1
votes

Reese's example is certainly sufficient in this case, but given this was a simple question with a mostly uninteresting answer, I'll instead present a very small variant simply from a programming style standpoint.

data two;
  set one;
  by Q R;
  if first.R then s=0;
  s+1;
run;

This will likely function exactly the same as Reese's code and the original question's code once the first.Q is removed. However, it has two slight differences.

First off, I like to group if first. variable-resetting code (that isn't otherwise dependent on location) in one place, as early as possible in the code (after by statements, where, "early" subsetting if, array, format, length, and retain). This is useful in an organizational standpoint because this code (usually) is something that roughly parallels what SAS does in between data step iterations, but for BY groups - so it's nice to have it at the start of the data step.

Second, I like initializing to zero. That avoids the need for the else conditional, which makes the code clearer; you're not doing anything different on the first row per BY group other than the re-initialization, so it makes sense to not have the increment be conditional.

These are both helpful in a more complex version of this data step; obviously in this particular step it doesn't matter much, but in a larger data step it can be helpful to organize your code more effectively. Imagine this data step:

data two;
  set one;
  by Q R;
  retain Z;
  format Z $12.;
  array CS[15];
  if first.R then do;  *re-init block;
    S=0;
    Z=' ';
  end;

  S+1;  *increment counter;

  do _t = 1 to dim(CS);
    Z = cats(Z,ifc(CS[_t]>0,put(CS[_t],8.),''));
  end;
  keep Q R S Z;
run;

That data step is nicely organized, and has a logical flow, even though almost every step could be moved anywhere else in the data step. Grouping the initializations together makes it more readable, particularly if you always group things that way.