0
votes

This post follow this one: SAS sum observations not in a group, by group Where my minimal example was a bit too minimal sadly,I wasn't able to use it on my data.

Here is a complete case example, what I have is :

data have;
   input group1 group2 group3 $ value;
   datalines;
1 A X 2
1 A X 4
1 A Y 1
1 A Y 3
1 B Z 2
1 B Z 1
1 C Y 1
1 C Y 6
1 C Z 7
2 A Z 3
2 A Z 9
2 A Y 2
2 B X 8
2 B X 5
2 B X 5
2 B Z 7
2 C Y 2
2 C X 1
;
run;

For each group, I want a new variable "sum" with the sum of all values in the column for the same sub groups (group1 and group2), exept for the group (group3) the observation is in.

data want;
   input group1 group2 group3 $ value $ sum;
   datalines;
1 A X 2 8
1 A X 4 6
1 A Y 1 9
1 A Y 3 7
1 B Z 2 1
1 B Z 1 2
1 C Y 1 13
1 C Y 6 8
1 C Z 7 7
2 A Z 3 11
2 A Z 9 5
2 A Y 2 12
2 B X 8 17
2 B X 5 20
2 B X 5 20
2 B Z 7 18
2 C Y 2 1
2 C X 1 2
;
run;

My goal is to use either datasteps or proc sql (doing it on around 30 millions observations and proc means and such in SAS seems slower than those on previous similar computations).

My issue with solutions provided in the linked post is that is uses the total value of the column and I don't know how to change this by using the total in the sub group. Any idea please?

1

1 Answers

2
votes

A SQL solution will join all data to an aggregating select:

proc sql;
  create table want as 
  select have.group1, have.group2, have.group3, have.value
    , aggregate.sum - value as sum 
  from 
    have
  join 
    (select group1, group2, sum(value) as sum
     from have
     group by group1, group2
    ) aggregate
  on
    aggregate.group1 = have.group1
  & aggregate.group2 = have.group2
;

SQL can be slower than hash solution, but SQL code is understood by more people than those that understand SAS DATA Step involving hashes ( which can be faster the SQL. )


data want2;
  if 0 then set have; * prep pdv;

  declare hash sums (suminc:'value');
  sums.defineKey('group1', 'group2');
  sums.defineDone();

  do while (not hash_loaded);
    set have end=hash_loaded;
    sums.ref();                * adds value to internal sum of hash data record;
  end;

  do while (not last_have);
    set have end=last_have;
    sums.sum(sum:sum);         * retrieve group sum.;
    sum = sum - value;         * subtract from group sum;
    output;
  end;

  stop;
run;

SAS documentation touches on SUMINC and has some examples

The question does not address this concept:

  • For each row compute the tier 2 sum that excludes the tier 3 this row is in

A hash based solution would require tracking each two level and three level sums:


data want2;
  if 0 then set have; * prep pdv;

  declare hash T2 (suminc:'value');   * hash for two (T)iers;
  T2.defineKey('group1', 'group2');   * one hash record per combination of group1, group2;
  T2.defineDone();

  declare hash T3 (suminc:'value');             * hash for three (T)iers;
  T3.defineKey('group1', 'group2', 'group3');   * one hash record per combination of group1, group2, group3;
  T3.defineDone();

  do while (not hash_loaded);
    set have end=hash_loaded;
    T2.ref();                * adds value to internal sum of hash data record;
    T3.ref();
  end;

  T2_cardinality = T2.num_items;
  T3_cardinality = T3.num_items;    

  put 'NOTE: |T2| = ' T2_cardinality;
  put 'NOTE: |T3| = ' T3_cardinality;

  do while (not last_have);
    set have end=last_have;
    T2.sum(sum:t2_sum);         
    T3.sum(sum:t3_sum);
    sum = t2_sum - t3_sum;
    output;
  end;

  stop;

  drop t2_: t3:;
run;