0
votes

I have the following mock up table

#n a b group
1  1 1  1
2  1 2  1
3  2 2  1
4  2 3  1
5  3 4  2
6  3 5  2
7  4 5  2   

I am using SAS for this problem. In column group, the rows that are interconnected through a and b are grouped. I will try to explain why these rows are in the same group

  • row 1 to 2 are in group 2 since they both have a = 1
  • row 3 is in group 2 since b = 2 in row 2 and 3 and row 2 is in group 1
  • row 3 and 4 are in group 1 since a = 2 in both rows and row 3 is in group 1

The overall logic is that if a row x contains the same value of a or b as row y, row x also belongs to the same group as y is a part of. Following the same logic, row 5,6 and 7 are in group 2.

Is there any way to make an algorithm to find these groups?

1
What do you want to happen if there is another observation with a=4 and b=2? Would that mean that there is only one group? or do you only want to process the rows in order A,B so that it will come between rows 6 and 7 and cause there to be four groups?Tom
Are a and b always increasing for each successive row? If yes, then Richard's answer will work, but if not then this is a much trickier problem that will involve making multiple passes through your data to identify connected components.user667489

1 Answers

1
votes

Case I:

Grouping defined as to be item linkage within contiguous rows.

Use the LAG function to examine both variables prior values. Increase the group value if both have changed. For example

group + ( a ne lag(a) and b ne lag(b) );

Case II:

Grouping determined from pair item slot value linkages over all data.

From grouping pairs by either key

General statement of problem:
-----------------------------
Given: P = p{i} = (p{i,1},p{i,2}), a set of pairs (key1, key2).

Find: The distinct groups, G = g{x}, of P,
      such that each pair p in a group g has this property:

      key1 matches key1 of any other pair in g.
      -or-
      key2 matches key2 of any other pair in g.

Demonstrates

… an iterative way using hashes. Two hashes maintain the groupId assigned to each key value. Two additional hashes are used to maintain group mapping paths. When the data can be passed without causing a mapping, then the groups have been fully determined. A final pass is done, at which point the groupIds are assigned to each pair and the data is output to a table.