4
votes

It seems that flex doesn't support UTF-8 input. Whenever the scanner encounter a non-ASCII char, it stops scanning as if it was an EOF.

Is there a way to force flex to eat my UTF-8 chars? I don't want it to actually match UTF-8 chars, just eat them when using the '.' pattern.

Any suggestion?

EDIT

The most simple solution would be:

ANY [\x00-\xff]

and use 'ANY' instead of '.' in my rules.

2
If it works, great :) proper unicode support would be nice tho.Aiden Bell
Agreed. I'm running in a different issue now, Flex checks for "if( yychar <= YYEOF ) { /* scanning ended */ }", but my UTF-8 chars are negative :(Martin Cote
You will have tons of problems. Looking at the internals it will be a mission to rewrite the ecs code, table generator and things. Might be better to start from scratch :P wanna help?Aiden Bell
Arrrgh. This is awful. I posted a question on the flex mailing list, we'll see what these guys have to say.Martin Cote
Just my 2 cents: using ANY [\x00-\xff] in place of . (dot) is a terrible idea: 1) not safe, this accepts invalid UTF-8 (overruns, non-Unicode planes), 2) it matches only one byte instead of UTF-8 multibyte, and 3) you need to enable 8-bit, which not all lex/flex tools support. To match one valid UTF-8 char, you need [\x00-\x7f]|[\xc2-\xdf][\x80-\xbf]|\xe0[\xa0-\xbf][\x80-\xbf]|[\xe1-\xec][\x80-\xbf][\x80-\xbf]|\xed[\x80-\x9f][\x80-\xbf]|[\xee\xef][\x80-\xbf][\x80-\xbf]|\xf0[\x90-\xbf][\x80-\xbf][\x80-\xbf]|[\xf1-\xf3][\x80-\xbf][\x80-\xbf][\x80-\xbf]|\xf4[\x80-\x8f][\x80-\xbf][\x80-\xbf]Dr. Alex RE

2 Answers

7
votes

I have been looking into this myself and reading the Flex mailing list to see if anyone thought about it. To get Flex to read unicode is a complex affair ...

UTF-8 encoding can be done, and most other encodings (the 16s) will lead to massive tables driving the automata.

A common method so far is:

What I did was simply write patterns that match single UTF-8 characters. They look something like the following, but you might want to re-read the UTF-8 specification because I wrote this so long ago.
You will of course need to combine these since you want unicode strings, not just single characters.

UB [\200-\277] %% 
[\300-\337]{UB}                   { do something } 
[\340-\357]{UB}{2}                { do something } 
[\360-\367]{UB}{3}                { do something } 
[\370-\373]{UB}{4}                { do something } 
[\374-\375]{UB}{5}                { do something }

Taken from the mailing list.

I may look at creating a proper patch for UTF-8 support after looking at it further. The above solution seems unmaintainable for large .l files. And is really ugly! You could use ranges similar to create a '.' substitute rule to match all ASCII and UTF-8 characters, but still rather ugly.

hope this helps!

1
votes

writing an negatet characterclass might also help:

[\n \t] return WHITESPACE; [^\n \t] retrun NON_WHITESPACE