3
votes

I've written very low performance descent recursion parser for general language (open source, for EBNF grammars). And I want to fix its performance by rewriting parser.

I read about lexical analysis, LL, LR, LALR parsers and modifications like LL(*), I read first 3 chapters of Dragon Book (about lexers and parsers), I explored open source projects like ANTLR and others.

And I want to know why this algorithm is not described. Maybe it is wrong way, I don't know. Or maybe I reinvented the wheel.

Assume we have grammar (e: end of file):

A: B? B 1? e
B: 0 | 1

Grammar after transform:

A: B B 1 e | B B e | B e
B: 0 | 1

Possible scenarios:

[01] [01] [1] [e]
[01] [01] [e]
[01] [e]

We can build something like FSM:

Symbol #0:

[01]: continue

Symbol #1:

[01]: continue
[e]: parse as "B e"

Symbol #2:

[1]: parse as "B B 1 e"
[e]: parse as "B B e"

It will parse token stream at O(N). For real grammar it can be modified to be more than simple FSM, but still O(N).

So I have these questions:

  1. Can this approach give positive results?

  2. Has it some relations with LL, LR and other parsers? At this moment I don't have enough understanding of those algorithms, I didn't try any of them.

  3. Which parsing algorithm is more fast for correct input string? I'm interested only in parsing correct input strings, because I'm making code generation tool for using with IDE, which can report errors itself. So I need most fast algorithm for this very specific task.

Thanks.

UPD:

I ended up with ANTLRv4, I found the target and runtime for my language (Swift) and I'm more than satisfied.

1
You forgot B 1 e in your transformed grammar. But it's not clear what you are asking. It's a well-known fact that the set of context-free languages is a superset of the set of regular languages. Are you asking how to recognize if a given grammar does, in fact, recognize a regular language? stackoverflow.com/questions/559763/…chepner
If you want somebody to tell you if you are re-inventing an algorithm, you should try asking on the Computer Science branch of Stack Exchange. FWIW, LR parsing using state machines generalized to push-down automata.Ira Baxter

1 Answers

2
votes

LALR(k) is O(N) and can be lightning fast if you reduce it to machine code for "branch on token in state to next state, stack the token value". (See this paper: https://web.archive.org/web/20170809025652id_/http://www.genesishistory.org/content/ProfPapers/VF-LRParsing.pdf)

It is unclear what you'd gain by trying to develop your idea out; how would be it be faster than that?

[What tends to matter the the most isn't parsing; it is usually the rate at which you can build lexemes, esp. eliminating white space].

If you are serious about building a tool, you should work on the tool and take the best technologies you can get so you don't have to invent them.

If you insist on inventing new technologies, then you'll end up having the patch/enhance/tune them, and never get around to building a tool. Maybe you have a great idea. You'll have to invest more energy to find out.

Just be sure you know which goal you are trying to achieve.