2
votes

I am writing a reader for PDF file format using boost spirit lexer and grammar.

The problem is, that this grammar is kind of context sensitive one. Actually, there is an object called Stream dictionary, which has the following structure:

<< - begin
*( - zero or more times
/NAME - being the key of dictionary
VALUE - being the value of the key in dictionary
)
>> - end
stream - keyword
DATA
endstream - keyword

So, the data inside the stream has exactly the size defined in the dictionary, given example:

<</LENGTH 4>>
stream
aaaaendstream

Now the problem. I need to tell the grammar, to skip n characters. Here is the rule for parsing such object.

stream_object %=
    dictionary_object
    >> whitespaces
    >> lexer.stream_begin
    > eol
    > qi::repeat(159)["a-z"]
    > lexer.stream_end;

As far as I have saw, every operation in this grammar uses the defined input lexer, and the line 'qi::repeat(159)["a-z"]' expecting a character simply fails, because the lexer does not know such sequence.

I had multiple ideas, all of them equally wrong.

Lexer state

For example, change the lexer state, after encountering "stream" token.

this->self += stream_begin[lex::_state = "STREAM_BEGIN"];
this->self("STREAM_BEGIN") = stream_end[lex::_state = initial_state()] | character;

This somehow works, unless there is a "endstream" token inside the data AND it also tries to match endstream in EVERY character sequence inside the data, which slows the parsing horribly.

ABANDON LEXER

Next approach was to abandon lexer completely.

stream_object %=
    dictionary_object
    >> whitespaces
    >> lit("stream")
    > lit("\r?\n")
    > qi::repeat(159)["a-z"]
    > lit("endstream");

This would work, but I don't like the idea of abandoning lexer only for single stupid rule. Also I've read about performance degradation when using lit instead of lexer, because of the token backtracking.

ANOTHER PARSER

The last approach was to ignore such object, and parse only the dictionary part. After successful parse of dictionary object check for following tokens, and read the rest of the data without a tokenizer, as shown in the second approach.

QUESTION

I would really love to see some kind of "seek forward" or "temporarily ignore lexer" directive, to be able to skip portion of the input, without dividing the parsing process into multiple places, or introducing obsessive overhead. Is there such thing? Thought about each of the approaches are appreciated.

Lexer Code

typedef boost::spirit::istream_iterator base_iterator_type;
typedef boost::spirit::classic::position_iterator2<base_iterator_type> pos_iterator_type;
typedef boost::spirit::lex::lexertl::token<pos_iterator_type> token_type;
typedef boost::spirit::lex::lexertl::actor_lexer<token_type> lexer_type;

class SpiritLexer : public boost::spirit::lex::lexer<lexer_type> {...}

Grammar Code

struct SpiritGrammar : qi::grammar<pos_iterator_type> {...}

Usage

SpiritLexer lexer;
SpiritGrammar grammar(lexer);
auto result = lex::tokenize_and_parse(input_begin_pos, input_end_pos, lexer, grammar, obj);
2
"I've read about performance degradation" I wonder where you read that. I've written about that, conjecturingly, too, but I've not seen any field evidence related to it. My gut tells me the expectation points, lookahead assertions and semantic actions can easily achieve the same token scanning efficacy as with a "dumb" lexer. The thing is, Lexers are simple beasts and as such rule out a number of noisy choices, but their simplicity is no match for Qi. IMHO. If lexing, I'd handroll the parser and be done with it.sehe
Interestingly, straightforward and well-intended use of a Lexer has a tendency to surprise people with significant performance degradation (see eg. How to benchmark Boost Spirit Parser? or here).sehe

2 Answers

1
votes

Can you consider making it much simpler by moving the stream seeking outside parsing/lexing?

And while you're at it, reconsider the need for the lexer? (Ah, I see, you consider this already. I'd say, BIG yes in this department, unless you know you need it for something).

  • Temporarily disabling the lexer is (obviously) impossible. You'd simple have to restart lexing from a different point/state. I've even come to the conclusion that some rudimentary directives that are present (undocumented, AFAIR) to switch Lexer state from withing Parser semantic actions cannot be made to work usefully, if at all

  • Avoiding the lexer opens up the possibility of using the directives from spirit::repository::qi which seems to be close what you are looking for:

Conceptually it seems pretty clear, to me, that you're mixing parsing and direction of the parsing. This indicates that they should be at different levels of abstraction.

1
votes

This is more a comment on your task

writing a reader for PDF file format using boost spirit lexer and grammar.

than an answer to your concrete questions. It's too big for a comment, though.

If your task indeed is to parse generic PDF files (and not merely some generated by a controlled set of PDF generators, your approach likely is doomed for many reasons, especially that the canonical start positions of indirect objects are listed in cross reference tables or streams. The main cross reference table/stream is referenced at the end of the document. Furthermore cross reference streams can be compressed which requires selective decompression before complete parsing.

If you ignore these tables and try to recognize indirect objects by parsing front to back (which is your approach as I understand it), you might be in for a surprise. Especially if there are multiple objects with the same ID you cannot be sure which is the actual one and which is ignored.

Furthermore the possibility to reference indirect objects will get you into trouble.

E.g. your example:

<</LENGTH 4>>
stream
aaaaendstream

might also look like this:

<</LENGTH 5 0 R>>
stream
aaaaendstream
...
5 0 obj
    4
endobj

Thus, you have to look ahead to know how long your stream is.