0
votes

I have a data set :

data data;
   input group $ count;
   datalines;
A 4
A 3
A 2
A 1
B 1
C 1
D 2
D 1
E 1
F 1
G 2
G 1
H 1
;
run;

The first variable is a group identifier, the second count the number of observations by group, the dataset is sorted by group and then by count in descending order.

I want a new variable that take the max of "count" variable for each group such as :

data data;
   input group $ count max_count;
   datalines;
A 4 4
A 3 4
A 2 4
A 1 4
B 1 1
C 1 1
D 2 2
D 1 2
E 1 1
F 1 1
G 2 2
G 1 2
H 1 1
;
run;

The closest I've managed to get is by doing :

data data;
    set data;
    by group;
    if first.group then max_count=count;
run;

But the result is :

data data;
   input group $ count max_count;
   datalines;
A 4 4
A 3 .
A 2 .
A 1 .
B 1 1
C 1 1
D 2 2
D 1 .
E 1 1
F 1 1
G 2 2
G 1 .
H 1 1
;
run;

Any idea how to perform this please?

2

2 Answers

4
votes

You can use sequential DOW loops for this type of processing. A DOW loop is one wherein a SET statement is within the loop -- an arrangement that is not taught in introductory DATA step programming. The first loop will 'measure' the group to compute some group level metric, and the second loop will apply the metric to each member in the group as the rows are output.

data want;
  do _n_ = 1 by 1 until (last.group);
    set have;
    by group;
    if count > maxcount then maxcount = count;
  end;
  do _n_ = 1 to _n_;
    set have;
    output;
  end;
run;

Edit - Simpler

I didn't read the question carefully -- it states

the dataset is sorted by group and then by count in descending order.

data want;
  set data;
  by group descending count;
  if first.group then max_count = count;
  retain max_count;
run;

Explanation

Accepting the sort condition prima facie means the maximum is already known at the first record in the group. Thus, there is no need for a DOW loop to 'measure' the group.

Highlights:

  • descending count in the BY by statement enforces the presumptive condition
    • If the counts are not sorted as expected there will be an ERROR in the Log
  • first.group detects the first row, in a group of rows, where the group value has changed. This is the point at which the count will be maximum if the BY conditions are met.

    • the tracker variable max_count is assigned only at this point
  • retain max_count is a non-executable statement. RETAIN tells the DATA Step compiler to not reset to missing the specified non-dataset variables (at the top of the implicit DATA step loop). Thus, the max_count is unchanged, after being assigned, for all rows in the group.

1
votes

Another way in data step you can also use hash object.

data want;
 if _n_ = 1 then do;
 dcl hash h(multidata:"Y", dataset:"data");
 h.definekey("group");
 h.definedone();
 end;
 set data ;
 by group;
  if  h.find() = 0 then do until(h.find_next() ne 0);
 max_count=sum(max_count,1);
  end;
 run;