I have been looking into this myself and reading the Flex mailing list to see if anyone thought about it. To get Flex to read unicode is a complex affair ...
UTF-8 encoding can be done, and most other encodings (the 16s) will lead to massive tables driving the automata.
A common method so far is:
What I did was simply write patterns that match single UTF-8
characters. They look something like
the following, but you might want to
re-read the UTF-8 specification
because I wrote this so long ago.
You will of course need to combine
these since you want unicode strings,
not just single characters.
UB [\200-\277] %%
[\300-\337]{UB} { do something }
[\340-\357]{UB}{2} { do something }
[\360-\367]{UB}{3} { do something }
[\370-\373]{UB}{4} { do something }
[\374-\375]{UB}{5} { do something }
Taken from the mailing list.
I may look at creating a proper patch for UTF-8 support after looking at it further. The above solution seems unmaintainable for large .l files. And is really ugly! You could use ranges similar to create a '.' substitute rule to match all ASCII and UTF-8 characters, but still rather ugly.
hope this helps!
ANY [\x00-\xff]
in place of . (dot) is a terrible idea: 1) not safe, this accepts invalid UTF-8 (overruns, non-Unicode planes), 2) it matches only one byte instead of UTF-8 multibyte, and 3) you need to enable 8-bit, which not all lex/flex tools support. To match one valid UTF-8 char, you need[\x00-\x7f]|[\xc2-\xdf][\x80-\xbf]|\xe0[\xa0-\xbf][\x80-\xbf]|[\xe1-\xec][\x80-\xbf][\x80-\xbf]|\xed[\x80-\x9f][\x80-\xbf]|[\xee\xef][\x80-\xbf][\x80-\xbf]|\xf0[\x90-\xbf][\x80-\xbf][\x80-\xbf]|[\xf1-\xf3][\x80-\xbf][\x80-\xbf][\x80-\xbf]|\xf4[\x80-\x8f][\x80-\xbf][\x80-\xbf]
– Dr. Alex RE