1
votes

I'm working on parsing PDF content streams. Strings are delimited by parentheses but can contain nested unescaped parentheses. From the PDF Reference:

A literal string shall be written as an arbitrary number of characters enclosed in parentheses. Any characters may appear in a string except unbalanced parentheses (LEFT PARENHESIS (28h) and RIGHT PARENTHESIS (29h)) and the backslash (REVERSE SOLIDUS (5Ch)), which shall be treated specially as described in this sub-clause. Balanced pairs of parentheses within a string require no special treatment.

EXAMPLE 1:

The following are valid literal strings: 
(This is a string)
(Strings may contain newlines
and such.)
(Strings may contain balanced parentheses ( ) and special characters (*!&}^% and so on).)

It seems like pushing lexer modes onto a stack would be the thing to handle this. Here's a stripped-down version of my lexer and parser.

lexer grammar PdfStringLexer;

Tj: 'Tj' ;
TJ: 'TJ' ;

NULL: 'null' ;

BOOLEAN: ('true'|'false') ;

LBRACKET: '[' ;
RBRACKET: ']' ;
LDOUBLEANGLE: '<<' ;
RDOUBLEANGLE: '>>' ;

NUMBER: ('+' | '-')? (INT | FLOAT) ;

NAME: '/' ID ;

// A sequence of literal characters enclosed in parentheses.
OPEN_PAREN: '(' -> more, pushMode(STR) ; 

// Hexadecimal data enclosed in angle brackets
HEX_STRING: '<' [0-9A-Za-z]+ '>' ; 

fragment INT: DIGIT+ ; // match 1 or more digits

fragment FLOAT:  DIGIT+ '.' DIGIT*  // match 1. 39. 3.14159 etc...
     |         '.' DIGIT+  // match .1 .14159
     ;

fragment DIGIT:   [0-9] ;        // match single digit

// Accept all characters except whitespace and defined delimiters ()<>[]{}/%
ID: ~[ \t\r\n\u000C\u0000()<>[\]{}/%]+ ;

WS: [ \t\r\n\u000C\u0000]+ -> skip ; // PDF defines six whitespace characters

mode STR;

LITERAL_STRING : ')' -> popMode ;
STRING_OPEN_PAREN: '(' -> more, pushMode(STR) ; 
TEXT : . -> more ;

parser grammar PdfStringParser;

options { tokenVocab=PdfStringLexer; } 

array: LBRACKET object* RBRACKET ;
dictionary: LDOUBLEANGLE (NAME object)* RDOUBLEANGLE ;
string: (LITERAL_STRING | HEX_STRING) ;
object
    : NULL
    | array
    | dictionary
    | BOOLEAN
    | NUMBER
    | string
    | NAME
    ;

content : stat* ;

stat
    : tj
    ;

tj: ((string Tj) | (array TJ)) ; // Show text

When I process this file:

(Oliver’s Army) Tj
((What’s So Funny ’Bout) Peace, Love, and Understanding) Tj

I get this error and parse tree:

line 2:24 extraneous input ' Peace, Love, and Understanding)' expecting 'Tj'

Parse tree

So maybe pushMode doesn't push duplicate modes onto the stack. If not, what would be the way to handle nested parentheses?


Edit

I left out the instructions regarding escape sequences within the string:

Within a literal string, the REVERSE SOLIDUS is used as an escape character. The character immediately following the REVERSE SOLIDUS determines its precise interpretation as shown in Table 3. If the character following the REVERSE SOLIDUS is not one of those shown in Table 3, the REVERSE SOLIDUS shall be ignored.

Table 3 lists \n, \r, \t, \b backspace (08h), \f formfeed (FF), \(, \), \\, and \ddd character code ddd (octal)

An end-of-line marker appearing within a literal string without a preceding REVERSE SOLIDUS shall be treated as a byte value of (0Ah), irrespective of whether the end-of-line marker was a CARRIAGE RETURN (0Dh), a LINE FEED (0Ah), or both.

EXAMPLE 2:

(These \
two strings \
are the same.)
(These two strings are the same.)

EXAMPLE 3:

(This string has an end-of-line at the end of it. 
)
(So does this one.\n)

Should I use this STRING definition:

STRING
 : '(' ( ~[()]+ | STRING )* ')'
 ;

without modes and deal with escape sequences in my code or create a lexer mode for strings and deal with escape sequences in the grammar?

1

1 Answers

2
votes

You could do this with lexical modes, but in this case it's not really needed. You could simply define a lexer rule like this:

STRING
 : '(' ( ~[()]+ | STRING )* ')'
 ;

And with escape sequences, you could try:

STRING
 : '(' ( ~[()\\]+ |  ESCAPE_SEQUENCE | STRING )* ')'
 ;

fragment ESCAPE_SEQUENCE
 : '\\' ( [nrtbf()\\] | [0-7] [0-7] [0-7] )
 ;