0
votes

I have a bunch of test data. Each test was done several dozen times and the average and margin of error for each test calculated in a CTE. In the next step I want to dense_rank each sub-group of tests. Here's an example of a sub-group of data and the rank value I'm looking for:

AvgScore  StdErr    DesiredRank 
65550     2109      1
67188     2050      1
67407     2146      1
67414     1973      1
67486     1889      2
67581     2320      2
67858     1993      2
68509     2029      2
68645     2039      2
68868     2051      2
68902     1943      2
69305     1564      3
69430     2037      3
69509     1594      3
387223    12521     4
389709    12975     4
392200    11344     4
398916    11755     4
399018    11480     5
401144    11021     5
401640    10973     5
403442    10688     5

Notice the margin of error for each score makes many scores ostensibly equivalent. Yes, this causes some rows to technically belong to more than one group but making it part of the nearest group gives the most accurate results.

I looked at Grouping data into fuzzy gaps and islands but this version seems significantly more complex since the switch from one group to another not only requires both rows to be within the margin of error of each other but a switch may occur between equivalent rows.

Here is the most complex case appearing in the example: Row 1 has rows 2-6 within its range and despite row 6 also being within row 1's range, row 5 doesn't have row 1 within its range so a new rank must be started at row 5.

There are only a couple hundred groups in the result set so performance shouldn't be an issue. I'm just struggling with finding logic that can not only look in both directions across the ordered range but recognize that some intermediate row has forced the starting of a new group. Obviously this is simple using a cursor but I have additional processing to do after the ranking and so I'm looking for a SET based solution if any is possible.

I'm on 2017 but if there is a set based non-recursive answer that requires 2019 I'm OK with that.

2
Do you have SQL Server 2012 or higher?Dávid Laczkó
It isn't what you need exactly but I hope that it could be a starting point: demo It's naive approach - taking first element of group and add stderr, changing group when actualValue > newRange.Lukasz Szozda
Thanks. Recursive CTE's are not my first choice but they are certainly useful in specific instances.bielawski

2 Answers

0
votes

I don't really like it when the depth of recursion depends on the number of rows in the data as opposed to an actual depth to the data. This solution works OK for me because I have so few rows to rank. All the same, for future readers, if someone has a non-recursive solution I'm happy to mark it as an answer rather than my own.

To demonstrate this IS set based I've added a GROUP BY column. Recursion depth depends on the number of items to be ranked not the number of groups. All groups are processed simultaneously. This code was tested on my production dataset and compared to answers generated by a sequential loop thru the data so I know it works on bigger, more complex data sets.

WITH T AS (
  SELECT * 
    FROM(VALUES ('Type1', 65550  ,2109  ,1),('Type2', 65550  ,2109  ,1),
                ('Type1', 67188  ,2050  ,1),('Type2', 67188  ,2050  ,1),
                ('Type1', 67407  ,2146  ,1),('Type2', 67407  ,2146  ,1),
                ('Type1', 67414  ,1973  ,1),('Type2', 67414  ,1973  ,1),
                ('Type1', 67486  ,1889  ,2),('Type2', 67486  ,1889  ,2),
                ('Type1', 67581  ,2320  ,2),('Type2', 67581  ,2320  ,2),
                ('Type1', 67858  ,1993  ,2),('Type2', 67858  ,1993  ,2),
                ('Type1', 68509  ,2029  ,2),('Type2', 68509  ,2029  ,2),
                ('Type1', 68645  ,2039  ,2),('Type2', 68645  ,2039  ,2),
                ('Type1', 68868  ,2051  ,2),('Type2', 68868  ,2051  ,2),
                ('Type1', 68902  ,1943  ,2),('Type2', 68902  ,1943  ,2),
                ('Type1', 69305  ,1564  ,3),('Type2', 69305  ,1564  ,3),
                ('Type1', 69430  ,2037  ,3),('Type2', 69430  ,2037  ,3),
                ('Type1', 69509  ,1594  ,3),('Type2', 69509  ,1594  ,3)) X(TestType,AvgScore,StdErr,DesiredRank)
), X AS (
  SELECT *,ROW_NUMBER() OVER(PARTITION BY TestType ORDER BY AvgScore) GRow,1 Rnk,AvgScore RAvg, AvgScore+StdErr RMax
  FROM T 
), Y AS (
  SELECT TestType,AvgScore,StdErr,DesiredRank,GRow,Rnk,RAvg,RMax,0 NewRank,0 pravg,0 prmin
    FROM X
   WHERE GRow = 1
   UNION ALL
  SELECT Z.TestType,Z.AvgScore,Z.StdErr,Z.DesiredRank,Z.GRow
        ,CASE WHEN W.NewRank = 1 THEN Y.Rnk+1 ELSE Y.Rnk END Rnk
        ,CASE WHEN W.NewRank = 1 THEN Z.RAvg  ELSE Y.RAvg END RAvg
        ,CASE WHEN W.NewRank = 1 THEN Z.RMax  ELSE Y.RMax END RMin
        ,W.NewRank,Y.RAvg pravg,y.RMax prmin
    FROM Y
   CROSS APPLY (SELECT * FROM X WHERE X.TestType=Y.TestType and X.GRow = Y.GRow+1) Z
   CROSS APPLY (VALUES (CASE WHEN Z.AvgScore <= Y.RMax and Z.AvgScore - Z.StdErr <= Y.RAvg THEN 0 ELSE 1 END)) W(NewRank)
)
SELECT * FROM Y
 ORDER BY TestType,AvgScore;
0
votes

It is really a tough one: first I thought I could just recursively increase the Rank if there is a certain missing overlap, by examining the highest Rank at a step so that lower AvgScore's will have less Rank increments. But I recognised that a recursive CTE's recursive element can not have
- aggregation + GROUP BY
- multiple references to the recursive CTE
- a nested CTE defined
so I gave up this direction. It seems data should be "prepared" in a way so that it can be fed to a simple recursion (can not think of any other solution than recursion).
So, my solution is to find the lowest AvgScore belonging to the first AvgScore that is out of range and mark it as a new Rank's first element, and "jump" to that element and repeat so at the end have all rows in a set that are the first row where a new Rank should be assigned ("first" meant by sorting by AvgScore). After that putting together all rows and Rank them.
So if your set is called @UltraFuzzy you can send it through a couple of CTE's:

;WITH UltraFuzzyCTE AS (
    SELECT AvgScore, StdErr, AvgScore - StdErr as RangeMIN, AvgScore + StdErr as RangeMAX
    FROM @UltraFuzzy
)
-- SELECT * FROM UltraFuzzyCTE ORDER BY AvgScore
,FirstOutOfRangeCTE AS (
SELECT
    Original.*
    ,MIN (Helper.AvgScore) as FirstOutOfRange
FROM UltraFuzzyCTE as Original
    LEFT OUTER JOIN UltraFuzzyCTE as Helper
        ON Original.RangeMAX < Helper.AvgScore OR Original.AvgScore < Helper.RangeMIN
GROUP BY Original.AvgScore, Original.StdErr, Original.RangeMIN, Original.RangeMAX
)
-- SELECT * FROM FirstOutOfRangeCTE ORDER BY AvgScore
,NewRankFirstMemberCTE AS (
    SELECT * FROM FirstOutOfRangeCTE WHERE AvgScore = (SELECT MIN (AvgScore) FROM FirstOutOfRangeCTE)
    UNION ALL
    SELECT f.*
    FROM NewRankFirstMemberCTE as n
        INNER JOIN FirstOutOfRangeCTE as f ON n.FirstOutOfRange = f.AvgScore
)
-- SELECT * FROM NewRankFirstMemberCTE ORDER BY AvgScore
,RankCTE AS (
SELECT *, 1 as NewRankFirstMember FROM NewRankFirstMemberCTE
UNION ALL
SELECT *, 0 as NewRankFirstMember FROM FirstOutOfRangeCTE WHERE AvgScore NOT IN (SELECT AvgScore FROM NewRankFirstMemberCTE)
)
-- SELECT * FROM RankCTE ORDER BY AvgScore
SELECT *, SUM (NewRankFirstMember) OVER (ORDER BY AvgScore) as Rank
FROM RankCTE
ORDER BY AvgScore

Definitely it can be simplyfied, for debugging I used SELECT * but unneccessary fields could be thrown away - and less CTE's used. The commented stuff is for step-by-step analysis.