6
votes

I am trying to write an ANTLR grammar for the PHP serialize() format, and everything seems to work fine, except for strings. The problem is that the format of serialized strings is :

s:6:"length";

In terms of regexes, a rule like s:(\d+):".{\1}"; would describe this format if only backreferences were allowed in the "number of matches" count (but they are not).

But I cannot find a way to express this for either a lexer or parser grammar: the whole idea is to make the number of characters read depend on a backreference describing the number of characters to read, as in Fortran Hollerith constants (i.e. 6HLength), not on a string delimiter.

This example from the ANTLR grammar for Fortran seems to point the way, but I don't see how. Note that my target language is Python, while most of the doc and examples are for Java:

// numeral literal
ICON {int counter=0;} :
    /* other alternatives */
    // hollerith
    'h' ({counter>0}? NOTNL {counter--;})* {counter==0}?
      {
      $setType(HOLLERITH);
      String str = $getText;
      str = str.replaceFirst("([0-9])+h", "");
      $setText(str);
      }
    /* more alternatives */
    ;
1

1 Answers

5
votes

Since input like s:3:"a"b"; is valid, you can't define a String token in your lexer, unless the first and last double quote are always the start and end of your string. But I guess this is not the case.

So, you'll need a lexer rule like this:

SString
  :  's:' Int ':"' ( . )* '";'
  ;

In other words: match a s:, then an integer value followed by :" then one or more characters that can be anything, ending with ";. But you need to tell the lexer to stop consuming when the value Int is not reached. You can do that by mixing some plain code in your grammar to do so. You can embed plain code by wrapping it inside { and }. So first convert the value the token Int holds into an integer variable called chars:

SString
  :  's:' Int {chars = int($Int.text)} ':"' ( . )* '";'
  ;

Now embed some code inside the ( . )* loop to stop it consuming as soon as chars is counted down to zero:

SString
  :  's:' Int {chars = int($Int.text)} ':"' ( {if chars == 0: break} . {chars = chars-1} )* '";'
  ;

and that's it.

A little demo grammar:

grammar Test;

options {
  language=Python;
}

parse
  :  (SString {print 'parsed: [\%s]' \% $SString.text})+ EOF
  ;

SString
  :  's:' Int {chars = int($Int.text)} ':"' ( {if chars == 0: break} . {chars = chars-1} )* '";'
  ;

Int
  :  '0'..'9'+
  ;

(note that you need to escape the % inside your grammar!)

And a test script:

import antlr3
from TestLexer import TestLexer
from TestParser import TestParser

input = 's:6:"length";s:1:""";s:0:"";s:3:"end";'
char_stream = antlr3.ANTLRStringStream(input)
lexer = TestLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = TestParser(tokens)
parser.parse()

which produces the following output:

parsed: [s:6:"length";]
parsed: [s:1:""";]
parsed: [s:0:"";]
parsed: [s:3:"end";]