2
votes

I have a dataset with a hierarchical codelist variable. The logics of hierarchy is determined by the LEVEL variable and the prefix structure of the CODE character variable. There are 6 (code length from 1 to 6) "aggregate" levels and the terminal level (code length of 10 characters).

I need to update the nodes variable (count of terminal nodes - the aggregate levels do not count in the "higher" aggregates, only the terminal nodes) - so the sum of counts in one level, for example every level 5's total count is the same as every level 6's. And I need to calculate (sum up) the weight to "higher" level nodes.

NOTE: I offset the output table's NODES and WEIGHT variable so you can see better what I am talking about (just add up the numbers in each offset and you get the same value).

EDIT1: there can be multiple observations with the same code. A unique observations is a combination of 3 variables code + var1 + var2.

Input table:

ID   level code         var1  var2  nodes  weight  myIndex
1    1     1            .     .     999    999     999
2    2     11           .     .     999    999     999
3    3     111          .     .     999    999     999
4    4     1111         .     .     999    999     999
5    5     11111        .     .     999    999     999
6    6     111111       .     .     999    999     999
7   10     1111119999   01    1     1      0.1     105,5
8   10     1111119999   01    2     1      0.1     109,1
9    6     111112       .     .     999    999     999
10  10     1111120000   01    1     1      0.5      95,0
11   5     11119        .     .     999    999     999
12   6     111190       .     .     999    999     999
13  10     1111901000   01    1     1      0.1      80,7
14  10     1111901000   02    1     1      0.2     105,5

Desired output table:

ID   level code         var1  var2  nodes    weight              myIndex
1    1     1            .     .     5        1.0                  98,1
2    2     11           .     .     5        1.0                  98,1
3    3     111          .     .     5        1.0                  98,1
4    4     1111         .     .     5        1.0                  98,1
5    5     11111        .     .       3          0.7              98,5
6    6     111111       .     .         2            0.2         107,3
7   10     1111119999   01    1           1               0.1    105,5  
8   10     1111119999   01    2           1               0.1    109,1
9    6     111112       .     .         1            0.5          95,0
10  10     1111120000   01    1           1               0.5     95,0
11   5     11119        .     .       2          0.3              97,2
12   6     111190       .     .         2            0.3          97,2
13  10     1111901000   01    1           1               0.1     80,7
14  10     1111901000   02    1           1               0.2    105,5

And here's the code I came up with. It works just like I wanted, but man, it is really slow. I need something way faster, because this is a part of a webservice which has to run "instantly" on request. Any suggestions on speeding up the code, or any other solutions are welcome.

%macro doit;

data temporary;
    set have;
run;

%do i=6 %to 2 %by -1;
    %if &i = 6 %then %let x = 10;
    %else %let x = (&i+1);

    proc sql noprint;
        select count(code)
        into :cc trimmed
        from have
        where level = &i;

        select code
        into :id1 - :id&cc
        from have
        where level = &i;
    quit;

    %do j=1 %to &cc.;

        %let idd = &&id&j;

        proc sql;
        update have t1
            set nodes = (
                       select sum(nodes)
                       from temporary t2
                       where t2.level = &x and t2.code like ("&idd" || "%")),
            set weight = (
                       select sum(weight)
                       from temporary t2
                       where t2.level = &x and t2.code like ("&idd" || "%"))   
            where (t1.level = &i and t1.code like "&idd");
        quit;
    %end;
%end;
%mend doit;

Current code based on @Quentin's solution:

data have;
input ID level code : $10. nodes weight myIndex;
cards;
1    1  1            .   .    .
2    2  11           .   .    .
3    3  111          .   .    .
4    4  1111         .   .    .
5    5  11111        .   .    .
6    6  111111       .   .    .
7   10  1111110000   1   0.1  105.5
8   10  1111119999   1   0.1  109.1
9    6  111112       .   .    .
10  10  1111129999   1   0.5  95.0
11   5  11119        .   .    .
12   6  111190       .   .    .
13  10  1111900000   1   0.1  80.7
14  10  1111901000   1   0.2  105.5
;

data want (drop=_:);

    *hash table of terminal nodes;
    if (_n_ = 1) then do;
        if (0) then set have (rename=(code=_code weight=_weight));
        declare hash h(dataset:'have(where=(level=10) rename=(code=_code weight=_weight myIndex=_myIndex))');
        declare hiter iter('h');
        h.definekey('ID');
        h.definedata('_code','_weight','_myIndex');
        h.definedone();
    end;

    set have;

    *for each non-terminal node, iterate through;
    *hash table of all terminal nodes, looking for children;
    if level ne 10 then do;
        call missing(weight, nodes, myIndex);

        do _n_ = iter.first() by 0 while (_n_ = 0);
            if trim(code) =: _code then do;  
                weight=sum(weight,_weight);
                nodes=sum(nodes,1);
                myIndex=sum(myIndex,_myIndex*_weight);
            end;
            _n_ = iter.next();
        end;
        myIndex=round(myIndex/weight,.1);
    end;
    output;
run;
6
How big is the real data? I'm thinking about some ugly brute force methods.Quentin
What do you mean by 'add the numbers in each offset'? Could you use a multilevel format here?Reeza
@Quentin the real data is dynamically changing, but always between 4-5000 observations, so it is not really big.Martin
@Reeza I was just trying to better explain/visualize what I am trying to achieve with that offset thing. The missalingned/offseted values are from the same hierarchy level, so you can see better, that it always add up to the same number (weight or node count). I am not aware of how multilevel formats work, but will check it out.Martin
@Quentin, user667489 the average CPU time of Quentin's solution is 6,84sec compared to 0,82sec of user667489's approach. That is some -88% cpu time improvement. The absolute real times are almost identical with a -88% relative improvement aswell. Tom's approach took more than half a minute. I am quite impressed by the speed of hash table lookups.Martin

6 Answers

2
votes

Below is a brute-force hash approach to doing a similar Cartesian product as in the SQL. Load a hash table of the terminal nodes. Then read through the dataset of nodes, and for each node that is not a terminal node, iterate through the hash table, identifying all of the child terminal nodes.

I think the approach @joop is describing may be more efficient, as this approach doesn't take advantage of the tree structure. So there is a lot of re-calculating. With 5000 records and 3000 terminal nodes, this would do 2000*3000 comparisons. But might not be that slow since the hash table is in memory, so you're not going to have excessive I/O ....

data want (drop=_:);

   *hash table of terminal nodes;
   if (_n_ = 1) then do;
      if (0) then set have (rename=(code=_code weight=_weight));
      declare hash h(dataset:'have(where=(level=10) rename=(code=_code weight=_weight))');
      declare hiter iter('h');
      h.definekey('ID');
      h.definedata('_code','_weight');
      h.definedone();
   end;

   set have;

   *for each non-terminal node, iterate through;
   *hash table of all terminal nodes, looking for children;
   if level ne 10 then do;
      call missing(weight, nodes);

      do _n_ = iter.first() by 0 while (_n_ = 0);
         if trim(code) =: _code then do;  
           weight=sum(weight,_weight);
           nodes=sum(nodes,1);
         end;
         _n_ = iter.next();
      end;
   end;
   output;
run;
2
votes

Here's an alternative hash approach.

Rather than using a hash object to do a cartesian join, this adds the nodes & weight from each level 10 node to each of the 6 applicable parent nodes as it goes along. This may be marginally faster than Quentin's approach as there are no redundant hash lookups.

It takes a bit longer than Quentin's approach when constructing the hash object, and uses a bit more memory, as each terminal node is added 6 times with different keys and existing entries often have to be updated, but afterwards each parent node only has to look up its own individual stats, rather than looping through all the terminal nodes, which is a substantial saving.

Weighted stats are possible as well, but you have to update both loops, not just the second one.

data want;
if 0 then set have;
dcl hash h();
h.definekey('code');
h.definedata('nodes','weight','myIndex');
h.definedone();
length t_code $10;
do until(eof);
  set have(where = (level = 10)) end = eof;
  t_nodes = nodes;
  t_weight = weight;
  t_myindex = weight * myIndex;
  do _n_ = 1 to 6;
    t_code = substr(code,1,_n_);
    if h.find(key:t_code) ne 0 then h.add(key:t_code,data:t_nodes,data:t_weight,data:t_myIndex);
    else do;
      nodes + t_nodes;
      weight + t_weight;
      myIndex + t_myIndex;
      h.replace(key:t_code,data:nodes,data:weight,data:MyIndex);
    end;
  end;
end;
do until(eof2);
  set have end = eof2;
  if level ne 10 then do;
    h.find();
    myIndex = round(MyIndex / Weight,0.1);
  end;
  output;
end;
drop t_:;
run;
1
votes

It seems pretty simple. Just join back with itself and count/sum.

proc sql ;
create table want as
 select a.id, a.level, a.code , a.var1, a.var2
      , count(b.id) as nodes
      , sum(b.weight) as weight
 from have a
 left join have b
 on a.code eqt b.code
 and b.level=10
 group by 1,2,3,4,5
 order by 1
;
quit;

If you don't want to use the EQT operator then you can use the SUBSTR() function instead.

 on a.code = substr(b.code,1,a.level)
 and b.level=10
1
votes

Since you're using SAS, how about using proc summary to do the heavy lifting here? No cartesian joins required!

One advantage of this option over the some of the others is that it's a bit easier to generalise if you want to calculate lots of more complex statistics for multiple variables.

data have;
input ID level code : $10. nodes weight myIndex;
format myIndex 5.1;
cards;
1    1  1            .   .    .
2    2  11           .   .    .
3    3  111          .   .    .
4    4  1111         .   .    .
5    5  11111        .   .    .
6    6  111111       .   .    .
7   10  1111110000   1   0.1  105.5
8   10  1111119999   1   0.1  109.1
9    6  111112       .   .    .
10  10  1111129999   1   0.5  95.0
11   5  11119        .   .    .
12   6  111190       .   .    .
13  10  1111900000   1   0.1  80.7
14  10  1111901000   1   0.2  105.5
;
run;


data v_have /view = v_have;
  set have(where = (level = 10));
  array lvl[6] $6;
  do i = 1 to 6;
    lvl[i]=substr(code,1,i);
  end;
  drop i;
run;

proc summary data = v_have;
  class lvl1-lvl6;
  var nodes weight;
  var myIndex /weight = weight;
  ways 1;
  output out = summary(drop = _:) sum(nodes weight)= mean(myIndex)=;
run;

data v_summary /view = v_summary;
  set summary;
  length code $10;
  code = cats(of lvl:);
  drop lvl:;
run;

data have;
  modify have v_summary;
  by code;
  replace;
run;

In theory a hash of hashes might also be an appropriate data structure, but that would be extremely complicated for a relatively small benefit. I might have a go anyway just as a learning exercise...

0
votes

One approach (I think) would be to make the Cartesian product, and find all of the terminal nodes that are a "match" to each of the nodes, then sum the weights.

Something like:

data have;
  input ID level code : $10. nodes weight ;
  cards;
1    1  1            .   .
2    2  11           .   .
3    3  111          .   .
4    4  1111         .   .
5    5  11111        .   .
6    6  111111       .   .
7   10  1111110000   1   0.1
8   10  1111119999   1   0.1
9    6  111112       .   .
10  10  1111129999   1   0.5
11   5  11119        .   .
12   6  111190       .   .
13  10  1111900000   1   0.1
14  10  1111901000   1   0.2
;


proc sql;
  select min(id) as id
       , min(level) as level 
       , a.code
       , count(b.weight) as nodes   /*count of terminal nodes*/
       , sum(b.weight) as weight    /*sum of weights of terminal nodes*/
    from 
      have as a 
     ,(select code , weight
       from have
       where level=10   /*selects terminal nodes*/
       ) as b
    where a.code eqt b.code        /*EQT is equivalent to =: */
    group by a.code
  ;
quit;

I'm not sure that is correct, but it gives the desired results for the sample data.

0
votes

This is the SQL needed to estimate the parent record for every record. It only uses string functions (position and length) so it should be adaptable to any dialect of SQL, maybe even SAS. (the CTE might need to be rewritten to subqueries or a view) The idea is to:

  • add a parent_id field to the dataset
  • find the record with the longest substring of code
  • and use its id as the value for our parent_id
  • (after that) update the records from the sum(nodes),sum(weight) of their direct children (the ones with child.parent_id = this.id )

BTW: I could have used the LEVEL instead of the LENGTH(code) ; the data is a bit redundant in this aspect.


WITH sub AS (
        SELECT id, length(code) AS len
        , code
        FROM tree)
UPDATE tree t
SET parent_id = s.id
FROM sub s
WHERE length(t.code) > s.len AND POSITION (s.code IN t.code) = 1
AND NOT EXISTS (
        SELECT *
        FROM sub nx
        WHERE nx.len > s.len AND POSITION (nx.code IN t.code ) = 1
        AND nx.len < length(t.code) AND POSITION (nx.code IN t.code ) = 1
        )
        ;

SELECT * FROM tree
ORDER BY parent_id DESC NULLS LAST
        , id
        ;

After finding the parents, the whole table should be updated (repeatedly) from itself like:


-- PREPARE omg( integer) AS
UPDATE tree  t
SET nodes = s.nodes ,  weight = s.weight
FROM ( SELECT parent_id , SUM(nodes) AS nodes , SUM(weight) AS weight
        FROM tree GROUP BY parent_id) s
WHERE s.parent_id = t.id
        ;

In SAS, this could probably be done by sorting on {0-parent_id, id} and do some retain+summation magic. (my SAS is a bit rusty in this area)


UPDATE: if only the leaf nodes have non-NULL (non-missing) values for {nodes, weight}, the aggregation can be done in one sweep for the entire tree, without first computing the parent_ids:

UPDATE tree  t
SET nodes = s.nodes ,  weight = s.weight
FROM ( SELECT p.id , SUM(c.nodes) AS nodes , SUM(c.weight) AS weight
        FROM tree p
        JOIN tree c ON c.lev > p.lev AND POSITION (p.code IN c.code ) = 1
        GROUP BY p.id
        ) s
WHERE s.id = t.id
        ;

An index on {lev,code} will probably speed up things. (assuming an index on id)