3
votes

Suppose I have a directed graph represented in a dataset named links, which has two variables: from_id and to_id. I want to use SAS Data Step to do two things: (1) count the number of nodes, and (2) count the number of edges.

Suppose the links dataset is as shown below.

from_id    to_id
----------------
   1         2
   2         3
   3         1
   3         2

In this example, there are 3 nodes and 4 edges. (We can assume there are no duplicate edges in links). The nodes are 1, 2, and 3. The edges are 1->2, 2->3, 3->1, and 3->2.

Below is a SAS Macro that uses SAS Data Step in conjunction with proc sql in order to count the nodes and edges. It works perfectly, but I wish to use SAS Data Step so that counting the nodes and edges may (potentially) be done faster.

/* display number of nodes and edges for graph */
%macro graph_info(links);
data nodes;
    set &links;
    node_id = from_id;
    output;
    node_id = to_id;
    output;
    keep node_id;
run;

proc sql noprint;
    select count(distinct node_id) into :numNodes from nodes;
quit;
proc datasets lib=work nolist;
    delete nodes;
quit;

proc sql noprint;
    select count(*) into :numEdges from &links;
quit;

%put Nodes: &numNodes;
%put Edges: &numEdges;
%mend;
1
How many nodes are you likely to have? Will they all fit in memory?Simon Nickerson
Up to a few hundred million nodes/edges. The datasets can easily be about 10 GB. Although I have access to machines with more memory, my desktop client has 8 GB of memory.synaptik
If it doesn't fit in memory, you can obviously do it with a single proc sort followed by a single data step.Simon Nickerson

1 Answers

5
votes

If you have enough memory, you may be able to do this with a hash object.

Be warned: this code is untested, as I don't have a SAS installation to hand. However the basic idea should work. You iterate through the data step, adding each node to the hash object, and on the last object you set macro variables to the size of the hash object.

data _null_;
  set links end=lastrec;
  format node_id 8.;
  if _N_ eq 1 then do;
    declare hash h();
    h.defineKey("node_id");
    h.defineDone();
  end;
  node_id = from_id;
  rc=h.find();
  if rc ne 0 then h.add();
  node_id = to_id;
  rc=h.find();
  if rc ne 0 then h.add();
  if lastrec then do;
    call symput('numLinks', put(h.num_items, 8. -L));
    call symput('numEdges', put(_N_, 8. -L));
  end;
run;