What is the significance of 0x8FFF in nvarchar comparisons?

Question

In SQL Server, nvarchar values represent strings of Unicode code-points - I understand, by default, using UTF-16, values beyond 0xFFFF are represented as surrogate-pairs.

I wanted to set a default string value for an nvarchar UDF parameter which would contain special characters. T-SQL does not let you use hex-escape sequences in string literals, you must use the CHAR() OR NCHAR() functions to specify characters by their codepoint values, however you must use a literal for parameter default values: you cannot use NCHAR(). However I remembered that SQL Server also performs implicit conversion from varbinary to nvarchar, so:

CREATE FUNCTION DoSomething(
    @foo nvarchar(50) = '\x0008', -- not supported by T-SQL syntax
    @bar nvarchar(50) = NCHAR(8), -- forbidden: defaults must be a literal
    @baz nvarchar(50) = 0x008     -- success!
)

I wanted to change the parameters to represent a comparison range, and I wanted the default values to represent the widest-possible range of values, thus letting me use Static SQL for a search function without needing OPTION(RECOMPILE) or the now-discredited (@foo IS NULL OR Table.Foo = @foo) pattern.

So I changed my function to this:

CREATE FUNCTION DoSomething(
    @fooMin nvarchar(50) = 0x0000,
    @fooMax nvarchar(50) = 0xFFFF
)
/* SELECT goes here */
WHERE
    Foo BETWEEN @fooMin AND @fooMax

I reasoned 0xFFFF would be high enough to accommodate any (practical) unicode text thrown at the system I was building.

However, to my surprise, the BETWEEN operator always returned false. I wondered if something might be up with the upper-bound operand so I changed it to 0x7FFF and it worked fine.

I tried 0x8FFF next and that also worked.

But 0x9FFF and then 0x9000 failed.

To my knowledge, there is nothing special about the 0x8FFF - 0x9000 boundary in Unicode. Wikipedia reports the Basic Multilingual Plane occupies 0x0000 - 0xFFFF and that 0x900 is simply another block in the CJK area: https://en.wikipedia.org/wiki/Plane_(Unicode)#/media/File:Roadmap_to_Unicode_BMP.svg and the UTF-16 surrogates start at 0xD800 and 0xDC00 - far away from 0x900.

Here's my test-case:

SELECT N'HELLO', 0xFF, ( CASE WHEN N'HELLO' BETWEEN 0x0000 AND 0xFF THEN 'yup' ELSE 'no' END )
UNION ALL
SELECT N'HELLO', 0x0FFF, ( CASE WHEN N'HELLO' BETWEEN 0x0000 AND 0x0FFF THEN 'yup' ELSE 'no' END )
UNION ALL
SELECT N'HELLO', 0x1000, ( CASE WHEN N'HELLO' BETWEEN 0x0000 AND 0x1000 THEN 'yup' ELSE 'no' END )
UNION ALL
SELECT N'HELLO', 0x6000, ( CASE WHEN N'HELLO' BETWEEN 0x0000 AND 0x6000 THEN 'yup' ELSE 'no' END )
UNION ALL
SELECT N'HELLO', 0x6FFF, ( CASE WHEN N'HELLO' BETWEEN 0x0000 AND 0x6FFF THEN 'yup' ELSE 'no' END )
UNION ALL
SELECT N'HELLO', 0x7000, ( CASE WHEN N'HELLO' BETWEEN 0x0000 AND 0x7000 THEN 'yup' ELSE 'no' END )
UNION ALL
SELECT N'HELLO', 0x7FFF, ( CASE WHEN N'HELLO' BETWEEN 0x0000 AND 0x7FFF THEN 'yup' ELSE 'no' END )
UNION ALL
SELECT N'HELLO', 0x8000, ( CASE WHEN N'HELLO' BETWEEN 0x0000 AND 0x8000 THEN 'yup' ELSE 'no' END )
UNION ALL
SELECT N'HELLO', 0x8FFF, ( CASE WHEN N'HELLO' BETWEEN 0x0000 AND 0x8FFF THEN 'yup' ELSE 'no' END )
UNION ALL
SELECT N'HELLO', 0x9000, ( CASE WHEN N'HELLO' BETWEEN 0x0000 AND 0x9000 THEN 'yup' ELSE 'no' END )
UNION ALL
SELECT N'HELLO', 0x9FFF, ( CASE WHEN N'HELLO' BETWEEN 0x0000 AND 0x9FFF THEN 'yup' ELSE 'no' END )
UNION ALL
SELECT N'HELLO', 0xFFFF, ( CASE WHEN N'HELLO' BETWEEN 0x0000 AND 0xFFFF THEN 'yup' ELSE 'no' END )

And my results:

HELLO   0xFF    yup
HELLO   0x0FFF  no
HELLO   0x1000  no
HELLO   0x6000  no
HELLO   0x6FFF  yup
HELLO   0x7000  yup
HELLO   0x7FFF  yup
HELLO   0x8000  no
HELLO   0x8FFF  yup
HELLO   0x9000  no
HELLO   0x9FFF  no
HELLO   0xFFFF  no

So it seems that it isn't just the 0x7FFF - 0x8000 boundary, but other boundaries too.

I wondered if it might be because it's interpreting the binary literals as little-endian instead of big-endian, but then all of the literals ending in **FF would return true because they're greater than N'H'.

You're ignoring collations which define sort ordering in SQL. If you force a binary collation, you should find that most of your range comparisons work — Damien_The_Unbeliever
"I reasoned 0xFFFF would be high enough to accommodate any (practical) unicode text thrown at the system I was building" - only if you don't have to deal with East Asian text, emojis, symbols, etc. Things you need surrogate pairs for in UTF-16. — Remy Lebeau
What about simply defaulting the parameters to NULL and have your function check for that internally? Then you don't need actual string literals in your parameter declarations, and can change your defaults over time without changing your declaration. — Remy Lebeau
@RemyLebeau that would result in a suboptimal SQL Server exectution plan: sommarskog.se/dyn-search.html — Dai

jim31415 jim31415 · Accepted Answer · 2016-08-25T13:01:13

Convert your fields to the same type before doing the comparison test:

select CASE WHEN convert(varbinary(82), N'HELLO') BETWEEN 0x0000 AND 0xffff THEN 'yup' ELSE 'no' END
select CASE WHEN convert(varbinary(82), N'HELLO') BETWEEN 0x4800 AND 0xffff THEN 'yup' ELSE 'no' END
select CASE WHEN convert(varbinary(82), N'HELLO') BETWEEN 0x4800 AND 0x4801 THEN 'yup' ELSE 'no' END

or

declare @x1 nvarchar(2) = 0x4800, @x2 nvarchar(2) = 0xFFFF;
declare @l1 nvarchar(2) = reverse(convert(varbinary(2), @x1));
declare @l2 nvarchar(2) = reverse(convert(varbinary(2), @x2));
select CASE WHEN N'HELLO' BETWEEN @l1 AND @l2 THEN 'yup' ELSE 'no' END

What is the significance of 0x8FFF in nvarchar comparisons?

1 Answers