I have a dataset with a hierarchical codelist variable. The logics of hierarchy is determined by the LEVEL variable and the prefix structure of the CODE character variable. There are 6 (code length from 1 to 6) "aggregate" levels and the terminal level (code length of 10 characters).
I need to update the nodes variable (count of terminal nodes - the aggregate levels do not count in the "higher" aggregates, only the terminal nodes) - so the sum of counts in one level, for example every level 5's total count is the same as every level 6's. And I need to calculate (sum up) the weight to "higher" level nodes.
NOTE: I offset the output table's NODES and WEIGHT variable so you can see better what I am talking about (just add up the numbers in each offset and you get the same value).
EDIT1: there can be multiple observations with the same code. A unique observations is a combination of 3 variables code + var1 + var2.
Input table:
ID level code var1 var2 nodes weight myIndex
1 1 1 . . 999 999 999
2 2 11 . . 999 999 999
3 3 111 . . 999 999 999
4 4 1111 . . 999 999 999
5 5 11111 . . 999 999 999
6 6 111111 . . 999 999 999
7 10 1111119999 01 1 1 0.1 105,5
8 10 1111119999 01 2 1 0.1 109,1
9 6 111112 . . 999 999 999
10 10 1111120000 01 1 1 0.5 95,0
11 5 11119 . . 999 999 999
12 6 111190 . . 999 999 999
13 10 1111901000 01 1 1 0.1 80,7
14 10 1111901000 02 1 1 0.2 105,5
Desired output table:
ID level code var1 var2 nodes weight myIndex
1 1 1 . . 5 1.0 98,1
2 2 11 . . 5 1.0 98,1
3 3 111 . . 5 1.0 98,1
4 4 1111 . . 5 1.0 98,1
5 5 11111 . . 3 0.7 98,5
6 6 111111 . . 2 0.2 107,3
7 10 1111119999 01 1 1 0.1 105,5
8 10 1111119999 01 2 1 0.1 109,1
9 6 111112 . . 1 0.5 95,0
10 10 1111120000 01 1 1 0.5 95,0
11 5 11119 . . 2 0.3 97,2
12 6 111190 . . 2 0.3 97,2
13 10 1111901000 01 1 1 0.1 80,7
14 10 1111901000 02 1 1 0.2 105,5
And here's the code I came up with. It works just like I wanted, but man, it is really slow. I need something way faster, because this is a part of a webservice which has to run "instantly" on request. Any suggestions on speeding up the code, or any other solutions are welcome.
%macro doit;
data temporary;
set have;
run;
%do i=6 %to 2 %by -1;
%if &i = 6 %then %let x = 10;
%else %let x = (&i+1);
proc sql noprint;
select count(code)
into :cc trimmed
from have
where level = &i;
select code
into :id1 - :id&cc
from have
where level = &i;
quit;
%do j=1 %to &cc.;
%let idd = &&id&j;
proc sql;
update have t1
set nodes = (
select sum(nodes)
from temporary t2
where t2.level = &x and t2.code like ("&idd" || "%")),
set weight = (
select sum(weight)
from temporary t2
where t2.level = &x and t2.code like ("&idd" || "%"))
where (t1.level = &i and t1.code like "&idd");
quit;
%end;
%end;
%mend doit;
Current code based on @Quentin's solution:
data have;
input ID level code : $10. nodes weight myIndex;
cards;
1 1 1 . . .
2 2 11 . . .
3 3 111 . . .
4 4 1111 . . .
5 5 11111 . . .
6 6 111111 . . .
7 10 1111110000 1 0.1 105.5
8 10 1111119999 1 0.1 109.1
9 6 111112 . . .
10 10 1111129999 1 0.5 95.0
11 5 11119 . . .
12 6 111190 . . .
13 10 1111900000 1 0.1 80.7
14 10 1111901000 1 0.2 105.5
;
data want (drop=_:);
*hash table of terminal nodes;
if (_n_ = 1) then do;
if (0) then set have (rename=(code=_code weight=_weight));
declare hash h(dataset:'have(where=(level=10) rename=(code=_code weight=_weight myIndex=_myIndex))');
declare hiter iter('h');
h.definekey('ID');
h.definedata('_code','_weight','_myIndex');
h.definedone();
end;
set have;
*for each non-terminal node, iterate through;
*hash table of all terminal nodes, looking for children;
if level ne 10 then do;
call missing(weight, nodes, myIndex);
do _n_ = iter.first() by 0 while (_n_ = 0);
if trim(code) =: _code then do;
weight=sum(weight,_weight);
nodes=sum(nodes,1);
myIndex=sum(myIndex,_myIndex*_weight);
end;
_n_ = iter.next();
end;
myIndex=round(myIndex/weight,.1);
end;
output;
run;