7
votes

According to this, SQL Server 2K5 uses UCS-2 internally. It can store UTF-16 data in UCS-2 (with appropriate data types, nchar etc), however if there is a supplementary character this is stored as 2 UCS-2 characters.

This brings the obvious issues with the string functions, namely that what is one character is treated as 2 by SQL Server.

I am somewhat surprised that SQL Server is basically only able to handle UCS-2, and even more so that this is not fixed in SQL 2K8. I do appreciate that some of these characters may not be all that common.

Aside from the functions suggested in the article, any suggestions on best approach for dealing with the (broken) string functions and UTF-16 data in SQL Server 2K5.

3
What string functions are broken please?gbn
LEN will return the number of UCS-2 characters in the string, not the number of UTF-16 characters. SUBSTRING will split UTF-16 characters in half. Same goes for LEFT and RIGHT. UPPER and LOWER would also probably break. REVERSE would definitely break. CHARINDEX and PATINDEX also. Not sure about DIFFERENCE and STUFF. So a lot of them....David Cameron
Thanks for pointing this out. The fact that it doesn't support ALL Unicode characters means that some UTF-16 string values (e.g. from Windows or .NET) are not valid to dump into SQL Server without verification. In order for any application to be bug-free and technically correct (how RARE bug-causing characters are doesn't make a bit of difference as far as correctness goes), ALL strings must be validated to contain UCS-2-compatible characters before being stored in SQL Server. Wonderful! Way to make my job that much harder Microsoft.Triynko
@Triynko stripping out all but values 0 - 65535 is unnecessary, inappropriate, and probably foolish. Supplementary Characters are not the only characters to act in non-intuitive ways. Combining characters that are perfectly valid in UCS-2, and sort and compare correctly, also have issues with many / most of the built-in string functions.Solomon Rutzky

3 Answers

9
votes

SQL Server 2012 now supports UTF-16 including surrogate pairs. See http://msdn.microsoft.com/en-us/library/ms143726(v=sql.110).aspx, especially the section "Supplementary characters".

So one fix for the original problem is to adopt SQL Server 2012.

3
votes

The string functions work fine with unicode character strings; the ones that care about the number of characters treat a two-byte character as a single character, not two characters. The only ones to watch for are len() and datalength(), which return different values when using unicode. They return the correct values of course - len() returns the length in characters, and datalength() returns the length in bytes. They just happen to be different because of the two-byte characters.

So, as long as you use the proper functions in your code, everything should work transparently.

EDIT: Just double-checked Books Online, unicode data has worked seemlessly with string functions since SQL Server 2000.

EDIT 2: As pointed out in the comments, SQL Server's string functions do not support the full Unicode character set due to lack of support for parsing surrogates outside of plane 0 (or, in other words, SQL Server's string functions only recognize up to 2 bytes per character.) SQL Server will store and return the data correctly, however any string function that relies on character counts will not return the expected values. The most common way to bypass this seems to be either processing the string outside SQL Server, or else using the CLR integration to add Unicode aware string processing functions.

-2
votes

something to add, that I just learned the hard way:

if you use an "n" field in oracle (im running 9i), and access it via the .net oracleclient, it seems that only parameterized sql will work... the N'string' unicode prefix doesnt seem to do the trick if you have some inline sql.

and by "work", I mean: it will lose any characters not supported by the base charset. So in my instances, english chars work fine, cyrillic turns into question marks/garbage.

this is a fuller discussion on the subject: http://forums.oracle.com/forums/thread.jspa?threadID=376847

Wonder if the ORA_NCHAR_LITERAL_REPLACE variable can be set in the connection string or something.