Ultra Fuzzy gaps and islands grouping problem

Question

I have a bunch of test data. Each test was done several dozen times and the average and margin of error for each test calculated in a CTE. In the next step I want to dense_rank each sub-group of tests. Here's an example of a sub-group of data and the rank value I'm looking for:

AvgScore  StdErr    DesiredRank 
65550     2109      1
67188     2050      1
67407     2146      1
67414     1973      1
67486     1889      2
67581     2320      2
67858     1993      2
68509     2029      2
68645     2039      2
68868     2051      2
68902     1943      2
69305     1564      3
69430     2037      3
69509     1594      3
387223    12521     4
389709    12975     4
392200    11344     4
398916    11755     4
399018    11480     5
401144    11021     5
401640    10973     5
403442    10688     5

Notice the margin of error for each score makes many scores ostensibly equivalent. Yes, this causes some rows to technically belong to more than one group but making it part of the nearest group gives the most accurate results.

I looked at Grouping data into fuzzy gaps and islands but this version seems significantly more complex since the switch from one group to another not only requires both rows to be within the margin of error of each other but a switch may occur between equivalent rows.

Here is the most complex case appearing in the example: Row 1 has rows 2-6 within its range and despite row 6 also being within row 1's range, row 5 doesn't have row 1 within its range so a new rank must be started at row 5.

There are only a couple hundred groups in the result set so performance shouldn't be an issue. I'm just struggling with finding logic that can not only look in both directions across the ordered range but recognize that some intermediate row has forced the starting of a new group. Obviously this is simple using a cursor but I have additional processing to do after the ranking and so I'm looking for a SET based solution if any is possible.

I'm on 2017 but if there is a set based non-recursive answer that requires 2019 I'm OK with that.

It isn't what you need exactly but I hope that it could be a starting point: demo It's naive approach - taking first element of group and add stderr, changing group when actualValue > newRange. — Lukasz Szozda
Thanks. Recursive CTE's are not my first choice but they are certainly useful in specific instances. — bielawski

bielawski bielawski · Accepted Answer · 2018-12-16T17:47:00

I don't really like it when the depth of recursion depends on the number of rows in the data as opposed to an actual depth to the data. This solution works OK for me because I have so few rows to rank. All the same, for future readers, if someone has a non-recursive solution I'm happy to mark it as an answer rather than my own.

To demonstrate this IS set based I've added a GROUP BY column. Recursion depth depends on the number of items to be ranked not the number of groups. All groups are processed simultaneously. This code was tested on my production dataset and compared to answers generated by a sequential loop thru the data so I know it works on bigger, more complex data sets.

WITH T AS (
  SELECT * 
    FROM(VALUES ('Type1', 65550  ,2109  ,1),('Type2', 65550  ,2109  ,1),
                ('Type1', 67188  ,2050  ,1),('Type2', 67188  ,2050  ,1),
                ('Type1', 67407  ,2146  ,1),('Type2', 67407  ,2146  ,1),
                ('Type1', 67414  ,1973  ,1),('Type2', 67414  ,1973  ,1),
                ('Type1', 67486  ,1889  ,2),('Type2', 67486  ,1889  ,2),
                ('Type1', 67581  ,2320  ,2),('Type2', 67581  ,2320  ,2),
                ('Type1', 67858  ,1993  ,2),('Type2', 67858  ,1993  ,2),
                ('Type1', 68509  ,2029  ,2),('Type2', 68509  ,2029  ,2),
                ('Type1', 68645  ,2039  ,2),('Type2', 68645  ,2039  ,2),
                ('Type1', 68868  ,2051  ,2),('Type2', 68868  ,2051  ,2),
                ('Type1', 68902  ,1943  ,2),('Type2', 68902  ,1943  ,2),
                ('Type1', 69305  ,1564  ,3),('Type2', 69305  ,1564  ,3),
                ('Type1', 69430  ,2037  ,3),('Type2', 69430  ,2037  ,3),
                ('Type1', 69509  ,1594  ,3),('Type2', 69509  ,1594  ,3)) X(TestType,AvgScore,StdErr,DesiredRank)
), X AS (
  SELECT *,ROW_NUMBER() OVER(PARTITION BY TestType ORDER BY AvgScore) GRow,1 Rnk,AvgScore RAvg, AvgScore+StdErr RMax
  FROM T 
), Y AS (
  SELECT TestType,AvgScore,StdErr,DesiredRank,GRow,Rnk,RAvg,RMax,0 NewRank,0 pravg,0 prmin
    FROM X
   WHERE GRow = 1
   UNION ALL
  SELECT Z.TestType,Z.AvgScore,Z.StdErr,Z.DesiredRank,Z.GRow
        ,CASE WHEN W.NewRank = 1 THEN Y.Rnk+1 ELSE Y.Rnk END Rnk
        ,CASE WHEN W.NewRank = 1 THEN Z.RAvg  ELSE Y.RAvg END RAvg
        ,CASE WHEN W.NewRank = 1 THEN Z.RMax  ELSE Y.RMax END RMin
        ,W.NewRank,Y.RAvg pravg,y.RMax prmin
    FROM Y
   CROSS APPLY (SELECT * FROM X WHERE X.TestType=Y.TestType and X.GRow = Y.GRow+1) Z
   CROSS APPLY (VALUES (CASE WHEN Z.AvgScore <= Y.RMax and Z.AvgScore - Z.StdErr <= Y.RAvg THEN 0 ELSE 1 END)) W(NewRank)
)
SELECT * FROM Y
 ORDER BY TestType,AvgScore;

Ultra Fuzzy gaps and islands grouping problem

2 Answers