Stop Flex when first matching a TOKEN

Question

I'm writing a flex/bison parser, and need to identify the following pattern using Flex:

begin
/*some code*/
end

The above pattern may appear a few times inside a code. For example:

begin
/*some code #1*/
end
/*some code #2*/
begin 
/*some code #3*/
end

It is important for me to identify the pattern in the lexer, but when using the following regex:

block "begin"[.\n]*"end"
{block} {return ID_BLOCK}

it catches the first begin and the LAST end. I would like to catch the first end. (please note#1: flex does not support all regex, so I cannot use regex zero length lookahead assertion please note #2: I think that the best way is to stop at the first match of "block" and not continue filling the buffer, I just dont know how to do it)

****EDIT**** The words begin and end are a simple example of unique words which will look like:

//BEGIN_SPECIAL_CODE
/*relevant code*/
//END_SPECIAL CODE

[.] is a literal ., so that is all it will match. . matches any character except a newline. So neither of those will match your begin ... end block. Please include real code in you question. — rici
Also: how do you know that a particularly instance of the three letters end mark the end of a block? What if the block contains the comment /* This comment extends the block */? — rici
[.\n]* recognises dots and newlines; any number of them, but only those two characters. Regex operators are not special inside character classes (and that's not a flex quirk; you'll find it to be true in pretty well all regex libraries). But that's just a detail; I'm sticking with my answer. — rici

rici rici · Accepted Answer · 2018-04-17T17:39:17

Normally, detecting complex syntactic structures like the blocks shown in n the examples is done by the parser, not the lexer. The lexer should just recognise simple lexemes, including the keywords begin and end (as well as comments, identifiers, literals and whatever other lexemes might be present in "code").

If you follow that model, finding the end of the block will be straight forward. Otherwise, you are likely to confounded by instances of the three letters end occurring within comments, string literals, or even as part of keywords or literals. (`friend class Extender;', to provide a simple C++ example.)

Stop Flex when first matching a TOKEN

1 Answers