Spirit qi seek n characters with lexer

Question

I am writing a reader for PDF file format using boost spirit lexer and grammar.

The problem is, that this grammar is kind of context sensitive one. Actually, there is an object called Stream dictionary, which has the following structure:

<< - begin
*( - zero or more times
/NAME - being the key of dictionary
VALUE - being the value of the key in dictionary
)
>> - end
stream - keyword
DATA
endstream - keyword

So, the data inside the stream has exactly the size defined in the dictionary, given example:

<</LENGTH 4>>
stream
aaaaendstream

Now the problem. I need to tell the grammar, to skip n characters. Here is the rule for parsing such object.

stream_object %=
    dictionary_object
    >> whitespaces
    >> lexer.stream_begin
    > eol
    > qi::repeat(159)["a-z"]
    > lexer.stream_end;

As far as I have saw, every operation in this grammar uses the defined input lexer, and the line 'qi::repeat(159)["a-z"]' expecting a character simply fails, because the lexer does not know such sequence.

I had multiple ideas, all of them equally wrong.

Lexer state

For example, change the lexer state, after encountering "stream" token.

this->self += stream_begin[lex::_state = "STREAM_BEGIN"];
this->self("STREAM_BEGIN") = stream_end[lex::_state = initial_state()] | character;

This somehow works, unless there is a "endstream" token inside the data AND it also tries to match endstream in EVERY character sequence inside the data, which slows the parsing horribly.

ABANDON LEXER

Next approach was to abandon lexer completely.

stream_object %=
    dictionary_object
    >> whitespaces
    >> lit("stream")
    > lit("\r?\n")
    > qi::repeat(159)["a-z"]
    > lit("endstream");

This would work, but I don't like the idea of abandoning lexer only for single stupid rule. Also I've read about performance degradation when using lit instead of lexer, because of the token backtracking.

ANOTHER PARSER

The last approach was to ignore such object, and parse only the dictionary part. After successful parse of dictionary object check for following tokens, and read the rest of the data without a tokenizer, as shown in the second approach.

QUESTION

I would really love to see some kind of "seek forward" or "temporarily ignore lexer" directive, to be able to skip portion of the input, without dividing the parsing process into multiple places, or introducing obsessive overhead. Is there such thing? Thought about each of the approaches are appreciated.

Lexer Code

typedef boost::spirit::istream_iterator base_iterator_type;
typedef boost::spirit::classic::position_iterator2<base_iterator_type> pos_iterator_type;
typedef boost::spirit::lex::lexertl::token<pos_iterator_type> token_type;
typedef boost::spirit::lex::lexertl::actor_lexer<token_type> lexer_type;

class SpiritLexer : public boost::spirit::lex::lexer<lexer_type> {...}

Grammar Code

struct SpiritGrammar : qi::grammar<pos_iterator_type> {...}

Usage

SpiritLexer lexer;
SpiritGrammar grammar(lexer);
auto result = lex::tokenize_and_parse(input_begin_pos, input_end_pos, lexer, grammar, obj);

"I've read about performance degradation" I wonder where you read that. I've written about that, conjecturingly, too, but I've not seen any field evidence related to it. My gut tells me the expectation points, lookahead assertions and semantic actions can easily achieve the same token scanning efficacy as with a "dumb" lexer. The thing is, Lexers are simple beasts and as such rule out a number of noisy choices, but their simplicity is no match for Qi. IMHO. If lexing, I'd handroll the parser and be done with it. — sehe
Interestingly, straightforward and well-intended use of a Lexer has a tendency to surprise people with significant performance degradation (see eg. How to benchmark Boost Spirit Parser? or here). — sehe

sehe sehe · Accepted Answer · 2014-12-13T19:40:15

Can you consider making it much simpler by moving the stream seeking outside parsing/lexing?

And while you're at it, reconsider the need for the lexer? (Ah, I see, you consider this already. I'd say, BIG yes in this department, unless you know you need it for something).

Temporarily disabling the lexer is (obviously) impossible. You'd simple have to restart lexing from a different point/state. I've even come to the conclusion that some rudimentary directives that are present (undocumented, AFAIR) to switch Lexer state from withing Parser semantic actions cannot be made to work usefully, if at all
Avoiding the lexer opens up the possibility of using the directives from spirit::repository::qi which seems to be close what you are looking for:
- The Spirit.Qi advance is a primitive parser component allowing the parser to skip (advance) through a specified number of iterations without performing unnecessary work:
- The seek[] parser-directive skips all input until the subject parser matches.

Conceptually it seems pretty clear, to me, that you're mixing parsing and direction of the parsing. This indicates that they should be at different levels of abstraction.