I am writing a reader for PDF file format using boost spirit lexer and grammar.
The problem is, that this grammar is kind of context sensitive one. Actually, there is an object called Stream dictionary, which has the following structure:
<< - begin
*( - zero or more times
/NAME - being the key of dictionary
VALUE - being the value of the key in dictionary
)
>> - end
stream - keyword
DATA
endstream - keyword
So, the data inside the stream has exactly the size defined in the dictionary, given example:
<</LENGTH 4>>
stream
aaaaendstream
Now the problem. I need to tell the grammar, to skip n characters. Here is the rule for parsing such object.
stream_object %=
dictionary_object
>> whitespaces
>> lexer.stream_begin
> eol
> qi::repeat(159)["a-z"]
> lexer.stream_end;
As far as I have saw, every operation in this grammar uses the defined input lexer, and the line 'qi::repeat(159)["a-z"]' expecting a character simply fails, because the lexer does not know such sequence.
I had multiple ideas, all of them equally wrong.
Lexer state
For example, change the lexer state, after encountering "stream" token.
this->self += stream_begin[lex::_state = "STREAM_BEGIN"];
this->self("STREAM_BEGIN") = stream_end[lex::_state = initial_state()] | character;
This somehow works, unless there is a "endstream" token inside the data AND it also tries to match endstream in EVERY character sequence inside the data, which slows the parsing horribly.
ABANDON LEXER
Next approach was to abandon lexer completely.
stream_object %=
dictionary_object
>> whitespaces
>> lit("stream")
> lit("\r?\n")
> qi::repeat(159)["a-z"]
> lit("endstream");
This would work, but I don't like the idea of abandoning lexer only for single stupid rule. Also I've read about performance degradation when using lit instead of lexer, because of the token backtracking.
ANOTHER PARSER
The last approach was to ignore such object, and parse only the dictionary part. After successful parse of dictionary object check for following tokens, and read the rest of the data without a tokenizer, as shown in the second approach.
QUESTION
I would really love to see some kind of "seek forward" or "temporarily ignore lexer" directive, to be able to skip portion of the input, without dividing the parsing process into multiple places, or introducing obsessive overhead. Is there such thing? Thought about each of the approaches are appreciated.
Lexer Code
typedef boost::spirit::istream_iterator base_iterator_type;
typedef boost::spirit::classic::position_iterator2<base_iterator_type> pos_iterator_type;
typedef boost::spirit::lex::lexertl::token<pos_iterator_type> token_type;
typedef boost::spirit::lex::lexertl::actor_lexer<token_type> lexer_type;
class SpiritLexer : public boost::spirit::lex::lexer<lexer_type> {...}
Grammar Code
struct SpiritGrammar : qi::grammar<pos_iterator_type> {...}
Usage
SpiritLexer lexer;
SpiritGrammar grammar(lexer);
auto result = lex::tokenize_and_parse(input_begin_pos, input_end_pos, lexer, grammar, obj);