0
votes

Similar Questions

Similar questions have been asked before, but always have specific qualities to the data which allow a more targeted "split it up and just sort by this part" approach, which does not work when you don't know the structure of the data in the column - or even the column, frankly. In other words, not a generic, "Natural" sort order - something roughly equivalent to SELECT * FROM [parts] ORDER BY [part_category] DESC, [part_number] NATURAL DESC

My Situation

I have a DataView in C# that has a Sort parameter for specifying the ORDER BY that would be used by ADO, and a requirement to sort by a column using a 'natural' sort algorithm. I could in theory do just about anything from creating a different column to sort by (based on the column I'd like to have 'sorted naturally') to not sorting in SQL, but rather sorting the result set in code afterwards. I'm looking for the best balance of flexibility, efficiency, preparation effort, and maintainability. I would benefit somewhat from being able to sort such data after retrieval (in C#) or completely within a stored procedure.

In my mind, and according to customer statements so far, 'Natural' sort order will mean treating upper and lower case letters equivalently, and considering the magnitude of a number, rather than the ASCII value of its digits (that is x90 comes before x100). Jeff Atwood had a pretty decent discussion of this, but it didn't address SQL sorting. That said, these are my thoughts:

  • Incorporating the magnitude awareness while also retaining the ability to sort alpha characters ASCII-betically may also come in handy
  • Non-alphanumeric characters would probably have to be sorted ASCII-betically regardless
  • Decimal point awareness might be more effort than it's worth, since most of the time periods and commas in alphanumeric fields are treated as merely punctuation/separators, and only denote fractional portions when they're representing a float field

My Question

What is a reasonably flexible, reasonably generic, reasonably efficient, approach to implementing a natural sort algorithm for SQL? Weighing the pros and cons, which is the best approach? Is there another option?

  • Is there a native SQL way to ORDER BY [field] NATURAL DESC or something?
  • PURE SQL function to create a 'sort equivalent' - Could be used to create some sort of second, possibly indexed, 'sort value' column, or called from a stored procedure, or specified in an 'ORDER BY' clause - but how to write it efficiently? (loops? is there a set based solution at all??)
  • CLR SQL Function - usability benefits of pure SQL function, but using procedural language, like C# (algorithm should be no problem, but can it be made to go faster than a pure SQL sort [set based??] implementation?) Also, could be referenced and utilized in C# if efficient enough.
  • Avoid SQL Server - since parsing an arbitrary number of numbers amid all sorts of other characters is really best suited for looping or recursion, and T-SQL is not well suited for looping or recursion (though TECHNICALLY supported, All I see is 'DON'T USE LOOPS!!!' and 'CTE's are even worse!!!')
  • Some sort of comparator in SQL(??) - doesn't seem like SQL lends itself to that sort of sorting and I don't see a way to specify a comparator to use - so I guess this won't work...

I have values at least as varied as the following:

100s455t
200s400
d399487
S0000005.2
d400400
d99222
cg9876
D550-9-1
CL2009-3-27
f2g099
f2g100
f2g1000
f2g999
cg 8837
99s1000f

These should be sorted as follows:

99s1000f
100s455t
200s400
cg9876
cg 8837
CL2009-3-27
D550-9-1
d99222
d399487
d400400
f2g099
f2g100
f2g999
f2g1000
S0000005.2
3
@TabAlleman - I checked this alphabetical order tool and it agrees with my original ordering (space after numbers) when doing the closest to what I consider to be a "natural sort" that it performs - but again, punctuation is arguable in far more venues than the magnitude of numbers vs the ASCII value of the characters that make them up (which has more consensus) - I think I'll leave it, butI appreciate you bringing it upCode Jockey
All I see is 'DON'T USE LOOPS!!!' and 'CTE's are even worse!!!' CTEs are not worse if used correctly, but don't use loops.Hogan
@CodeJockey - I would not parse a string with a CTE. I think the most efficient would be a CLR user defined function in C# msdn.microsoft.com/en-us/library/w2kae45k%28v=vs.90%29.aspxHogan
The reason you are having issues here is because you have multiple values stored in a single column. This violates 1NF and causes all sorts of issues, sorting being one of them.Sean Lange
A CLR will be the best performance wise for sure. check out this article for a number of different splitters and performance tests of each. sqlperformance.com/2012/07/t-sql-queries/split-stringsSean Lange

3 Answers

1
votes

Create a sort column. That way you can keep all the usual mechanisms in place that you use today to sort. You can index that column for example.

Split the string into parts. You need to pad number parts with zeroes to the maximum possible number length.

For example CL2009-3 would become CL|000002009|-|000000003.

This way the usual case-insensitive SQL Server collation sort behavior will create the right order.

Doing a natural sort dynamically prevents indexing, requires the entire data set to move into the app for each query and is resource intensive.

Instead, simply update the sort column whenever you update the base column.

0
votes

OK. Here is something that is almost what you are looking for. The only piece it can't deal with is when there are some characters then a space and then numbers (cg 8837 and cg9876). Would be good if in the future you could post the ddl and sample data so we can work with it.

with Something (SomeValue) as(
    select '100s455t' union all
    select '200s400' union all
    select 'd399487' union all
    select 'S0000005.2' union all
    select 'd400400' union all
    select 'd99222' union all
    select 'cg9876' union all
    select 'D550-9-1' union all
    select 'CL2009-3-27' union all
    select 'f2g099' union all
    select 'f2g100' union all
    select 'f2g1000' union all
    select 'f2g999' union all
    select 'cg 8837' union all
    select '99s1000f'
)

select * 
from Something
order by
    cast(
    case when patindex('%[A_Za-z]%', SomeValue) = 1 then '99999999999'
         when patindex('%[A_Za-z]%', SomeValue) = 0 then SomeValue
         else substring(SomeValue, 1, patindex('%[A_Za-z]%', SomeValue) - 1)
    end as bigint),
    SomeValue
-1
votes

I would recommend to "stay away from the SQL Server". While technically you can implement everything using t-sql or clr function, SQL server remains a single non scalable unit of the infrastructure. Using its CPU resources to do a heavy computing is going to inevitably impact performance of the system in general. And in the end, SQL server will be performing sort using almost exactly the same algorithm that you will use to sort your array on the application side, i.e. looking at each item in array and comparing it to the others until it finds an appropriate position.

Of course I am assuming that, if you try to implement this type of sort on the SQL server side, you will be copying the data into a temporary table before performing the sort, to avoid data locks etc.