1
votes

I am building a compiler and have one doubt. In case of STRING, we do not have any idea about the length, but in case of CHARACTER we know that it will be either 3 or 4 symbols including starting and ending single quotes (ex. ‘a’, ‘/t’). So, can we take advantage of this knowledge and flag error in the Lexer if length of input string is more than 4 symbols or this is the work of Parser?

Ex. ‘aaaaaaaaaaa’

Is this a lexical error for character or syntactic error? What could be the optimal location for this check?

1
That depends entirely on the intended semantics of the language you are intending to parse. For example, in Python, 'some string' is (almost) entirely equivalent to "some string". Not so much in C or C++, where the single quotes are used for character literals, including multi-byte character literals, and the double quotes denote a string literal.twalberg

1 Answers

2
votes

In C, ‘aaaaaaaaaaa’ is neither a lexical nor a syntactic error, although its semantics are implementation-defined:

The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined. (C standard, section 6.4.4.4, paragraph 10.)

It would have been easy to restrict character constants to a single character or escape sequence, but not by counting the display length of the character constant. (For example, 'ab' (length 4) would be illegal while '\x2C' (length 5) is legal, and '\u00C3' (length 6) depends on encoding.)

In any case, the frontier between "lexical" and "syntactic" errors is not particularly well-defined, and particularly not for C, in which 23skidoo is a valid preprocessor token but not a valid token.

If your question is "should I detect and react to this error in the scanner or the parser", I would answer, "whichever seems most convenient to you". My preference is to centralize all error handling in a single place, though, which means in the parser, and that requires the scanner to pass a special "bad token" token to the parser in order to trigger error detection (and possibly recovery) in the parser.