Return value of Lex input function

Question

According to POSIX Lex, the function input shall return zero when end-of-file is reached:

int input(void) Returns the next character from the input, or zero on end-of-file. It shall obtain input from the stream pointer yyin, although possibly via an intermediate buffer. Thus, once scanning has begun, the effect of altering the value of yyin is undefined. The character read shall be removed from the input stream of the scanner without any processing by the scanner.

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/lex.html

However, at least in Flex it seems like input sometimes returns -1 (EOF) instead of 0. Also, some examples I have seen rely on EOF instead of 0, for instance in the book "Lex and Yacc":

https://books.google.se/books?id=fMPxfWfe67EC&pg=PA152&lpg=PA152&dq=flex+input+returns+eof&source=bl&ots=RdLSgm5LEO&sig=sXajxhnlydQLz_GcZZuIaUONYlk&hl=sv&sa=X&ved=0ahUKEwjE58OwidDZAhWLiSwKHVdSD8kQ6AEIYDAF#v=onepage&q=flex%20input%20returns%20eof&f=false

Do I really need to test for 0 and EOF after using the function input?

rici rici · Accepted Answer · 2018-03-03T19:44:47

I'm afraid you do need to check for both values.

As far as I know, Posix has always required input() to return 0 on end of input, which was based on the behaviour of the original AT&T lex. While this specification made it easy to redefine input() to accept input from a string rather than an external file, it also makes it essentially impossible to distinguish between a NUL byte in the input stream and an EOF. This wasn't really a problem for the original lex implementation, which did not attempt to handle input streams with NUL bytes. (Posix does not require a text file to be able to include NUL bytes, so it's not a problem for Posix either.)

Flex, which aspired to handle arbitrary 8-bit input, redefined the input() API to return EOF (a negative number, usually -1) to indicate end of input. That was its behaviour until version 2.6.1, released on March 1, 2016, which changed the interface to conform with Posix. At least, I assume that is why the interface was changed. I can't find any documentation explaining the change, and the commit doesn't provide any information.

That change was not reflected into the documentation, which continues to include example code with the old specification. (This code is very similar to the sample code in John Levine's books.) A badly-title bug complaining about the change was closed without commentary. The change does not appear in the Change Log.

In any event, Posix is unlikely to change at this point, so other implementations of the lex tool may implement either the historic flex convention or the Posix requirement. And the value returned by flex-generated analysers will depend on the version of flex used to build the analyser. So portable code will have to allow for both conventions.

Posix does not explicitly require that the value returned by input() be positive, but it seems reasonable to assume that the intention was that the value would be the same as the value returned by fgetc() ("the next byte as an unsigned char converted to an int"). That's certainly what flex does. If you decide to count on that interpretation, you could simply test whether the return value from input() is less than or equal to 0.

As an editorial aside, I have never used input() without eventually regretting it. There is almost always a better solution, usually involving start conditions. Aside from the details referenced by this question, input() does not integrate well with the flex infrastructure. Characters read with input() cannot be added to the current token, nor can they be reprocessed with yyless(). Automatic maintenance of yylineno will fail if a newline character is read with input(), and that is likely to affect user-supplied maintenance of column position. And so on.

With AT&T lex, it made a certain amount of sense to use input() to skip the text of multiline comments. In the 1970s, RAM was a much more precious resource than it is now, and lex was not very good at handling large tokens. So reading (and building) a comment token was an unnecessary and potentially dangerous step, since multiline comments could be quite large (relative to, say, an identifier token). AT&T lex read file input one character at a time using input() (which was normally aliased to fgetc), so the use of input() did not incur any overhead.

These days, none of that holds. RAM is relatively cheap, and flex is not burdened by having to use an internal buffer big enough to hold a multiline comment. On the other side, since flex maintains its own internal buffer, it needs to emulate input() on top of its own buffer management, which does incur a certain overhead. So it should be uncommon to see snippets like the one I referred to from the flex manual; the start-condition-based comment detector is more efficient, shorter, and arguably more readable:

"/*"                 BEGIN(COMMENT);
<COMMENT>[*]+/       BEGIN(INITIAL);
<COMMENT>[^*]+|[*]+  ;

Return value of Lex input function

1 Answers