12
votes

Almost 5 years ago Joel Spolsky wrote this article, "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)".

Like many, I read it carefully, realizing it was high-time I got to grips with this "replacement for ASCII". Unfortunately, 5 years later I feel I have slipped back into a few bad habits in this area. Have you?

I don't write many specifically international applications, however I have helped build many ASP.NET internet facing websites, so I guess that's not an excuse.

So for my benefit (and I believe many others) can I get some input from people on the following:

  • How to "get over" ASCII once and for all
  • Fundamental guidance when working with Unicode.
  • Recommended (recent) books and websites on Unicode (for developers).
  • Current state of Unicode (5 years after Joels' article)
  • Future directions.

I must admit I have a .NET background and so would also be happy for information on Unicode in the .NET framework. Of course this shouldn't stop anyone with a differing background from commenting though.

Update: See this related question also asked on StackOverflow previously.

4

4 Answers

9
votes

Since I read the Joel article and some other I18n articles I always kept a close eye to my character encoding; And it actually works if you do it consistantly. If you work in a company where it is standard to use UTF-8 and everybody knows this / does this it will work.

Here some interesting articles (besides Joel's article) on the subject:

A quote from the first article; Tips for using Unicode:

  • Embrace Unicode, don't fight it; it's probably the right thing to do, and if it weren't you'd probably have to anyhow.
  • Inside your software, store text as UTF-8 or UTF-16; that is to say, pick one of the two and stick with it.
  • Interchange data with the outside world using XML whenever possible; this makes a whole bunch of potential problems go away.
  • Try to make your application browser-based rather than write your own client; the browsers are getting really quite good at dealing with the texts of the world.
  • If you're using someone else's library code (and of course you are), assume its Unicode handling is broken until proved to be correct.
  • If you're doing search, try to hand the linguistic and character-handling problems off to someone who understands them.
  • Go off to Amazon or somewhere and buy the latest revision of the printed Unicode standard; it contains pretty well everything you need to know.
  • Spend some time poking around the Unicode web site and learning how the code charts work.
  • If you're going to have to do any serious work with Asian languages, go buy the O'Reilly book on the subject by Ken Lunde.
  • If you have a Macintosh, run out and grab Lord Pixel's Unicode Font Inspection tool. Totally cool.
  • If you're really going to have to get down and dirty with the data, go attend one of the twice-a-year Unicode conferences. All the experts go and if you don't know what you need to know, you'll be able to find someone there who knows.
4
votes

I spent a while working with search engine software - You wouldn't believe how many web sites serve up content with HTTP headers or meta tags which lie about the encoding of the pages. Often, you'll even get a document which contains both ISO-8859 characters and UTF-8 characters.

Once you've battled through a few of those sorts of issues, you start taking the proper character encoding of data you produce really seriously.

3
votes

The .NET Framework uses Windows default encoding for storing strings, which turns out to be UTF-16. If you don't specify an encoding when you use most text I/O classes, you will write UTF-8 with no BOM and read by first checking for a BOM then assuming UTF-8 (I know for sure StreamReader and StreamWriter behave this way.) This is pretty safe for "dumb" text editors that won't understand a BOM but kind of cruddy for smarter ones that could display UTF-8 or the situation where you're actually writing characters outside the standard ASCII range.

Normally this is invisible, but it can rear its head in interesting ways. Yesterday I was working with someone who was using XML serialization to serialize an object to a string using a StringWriter, and he couldn't figure out why the encoding was always UTF-16. Since a string in memory is going to be UTF-16 and that is enforced by .NET, that's the only thing the XML serialization framework could do.

So, when I'm writing something that isn't just a throwaway tool, I specify a UTF-8 encoding with a BOM. Technically in .NET you will always be accidentally Unicode aware, but only if your user knows to detect your encoding as UTF-8.

It makes me cry a little every time I see someone ask, "How do I get the bytes of a string?" and the suggested solution uses Encoding.ASCII.GetBytes() :(

2
votes

Rule of thumb: if you never munge or look inside a string and instead treat it strictly as a blob of data, you'll be much better off.

Even doing something as simple as splitting words or lowercasing strings becomes tough if you want to do it "the Unicode way".

And if you want to do it "the Unicode way", you'll need an awfully good library. This stuff is incredibly complex.