I have a lex/yacc project which does one thing perfectly fine. I have another lex yacc which does another thing perfectly fine . The main part of these yacc have input and output file specified and in call back I put results in output file . How can I give output of first lex yacc into input of another lex yacc . I do not want to generate intermediate files and push the output file and input to second program . That part I want to change . The problem is that the count of lex , yacc which does 1 thing perfectly fine is huge and is increasing (currently 100) and hence I am looking for a generic strategy that I can reuse . I am looking for a generic solution where all lex/yacc pairs form part of a single project that i can work with . Any suggestion?
1 Answers
If I understand the question correctly, it has little to do with parsing tools (but see note below) since it is an instance of a more general problem – chaining many-to-many transformers – which is an instance of the Producer-Consumer problem.
In short, the problem is that the transformers being chained are not one-to-one, so you cannot simply provide the first one with some input, collect its corresponding output, and then feed that into the second one. Instead, you have to provide some form of control flow inversion, in which both consumer and producer can suspend their execution (retaining their own call stack) while the other process is executing. This model can easily implemented by coroutines, but C++ does not (yet) have such things. In other languages, like Go, this problem is solved with channels, but again these are not (yet) native to C++.
This is precisely the problem solved by the standard shell "pipe" operator (|), which is implemented under the hood using the pipe system call and multiple processes. (There is an example in the linked manpage.)
I hope that between all those links you can find some approach which meets your needs.
Upon reflection, and to make this answer possibly more useful for posterity, it's worth noting that the problem does actually have to do with a limitation of the standard parser generator tools.
In essence, yacc/bison and (f)lex work together to produce a tool which reads input from some source, chunk by chunk, and invokes user-defined actions (a combination of tokenizer actions and reduction actions) as appropriate. So you can view the control flow as:
--> tool -->
/\
/ \
/ \
/ \
/-->read perform<-\
| chunk action |
| | | |
|____/ \____|
The control flow inside the tool is hermetic, so we cannot simply suspend execution in order to wait for input, or to provide partial "output" (i.e. post-action processing).
In order to couple two or more of these tools together, though, we need to be able to do just that. The input of the second process needs to come from the output of the first process, but the second process needs to call (something) to get its input, while the first process is determined to call (something) to handle its output.
So the standard solutions are as described in the first part of the answer: decouple the two tools from their control flow by running them in separate processes/threads/fibres/coroutines/etc. which communicate by means of pipes/queues/channels/etc. The intermediate object, the pipe/queue/channel, has the interesting feature that both ends can be "call"ed, thereby allow us to invert the control flow.
But the whole problem would be avoided if the tools were not call+call, but rather call+return (calls something to get input, but returns each individual piece of output) or called+call (is called with each chunk of input, and calls something to handle each chunk of output). In either of those cases, we could just chain two or more tools together. For example, in the second case (called+call) we would have:
/-->input---->tool1
|____/ |
/-->|
/ |---->tool2
|___/ |
/-->|
/ |---->output
|___/
This is often called the "push" model, because we "push" successive chunks of input into the system, which eventually provides (through a callback) successive output actions. There is also the "pull" model, above described as call+return, in which we call the system every time we need some output, and it calls an input provider whenever it needs more input to satisfy the request:
/-->call<----tool2
|____/ |
/-->|
/ |<----tool1
|___/ |
/-->|
/ |<----input
|___/
(Note that in the push model, we call tool1, pushing the next input, while in the pull model we call tool2 to get the next result.)
It is trivial to find parsers which implement the push model. The Lemon parser has always worked that way, and bison has had a push API for quite some time now. But that's not good enough, unless we have some mechanism other than a (f)lex-built scanner to provide the token stream, because what we want to push is a "chunk of input", not a series of tokens.
Moreover, it is trivial to find scanners which implement the pull model. (F)lex has always done that; it returns every time it has a new token.
But to glue the two together into a push or pull composite, we need them to both push or both pull.
This shouldn't be difficult. Neither (f)lex nor yacc/bison produce recursive functions, so saving state between successive calls should be straight-forward. (F)lex scanners already do that, with one caveat (mentioned below) which allows them to act in pull mode. Bison's push API does that, in order to act in push mode.
But...
Bison claims to have a "pull" API, to go along with the "push" API. But according to the model presented above, it's not really a "pull" API, since it doesn't return from each action (as (f)lex scanners usually do). Generating code which allows a return from an action means that the state would need to be saved before the action is invoked, since C doesn't allow the interception of a return
statement, and (as far as I know) bison doesn't do that. (But it might not be difficult to add.)
On the other hand, (f)lex scanners have almost all of the necessary state infrastructure but with one small problem: the most important part of the state, which is the actual state of the state machine being emulated, is not saved. The consequence is that you can only run the state machine from the beginning of a token. Since input chunks do not fall on token boundaries, in order to handle a new chunk of input, the scanner would have to start by rescanning the current partially scanned token. This would get really inefficient if the input chunks were small. If a push interface were provided, the expectation would be that it could be called with successive characters (in order to emulate interactive mode), and that would automatically lead to scanning time quadratic in the length of the token.
In fact, this actually happens (internally) every time that flex scanners need to refill their buffer, so small buffer sizes (or large token sizes) can create very bad performance. Fortunately, in typical language processors, neither of these things is true: tokens are small, buffers are big, and rescans are infrequent.
It would not actually be difficult to save the state machine's state. Not doing so was a deliberate choice, made to save one register in the inner scanning loop, since the original hardware target for flex was a register-starved machine. With many modern architectures, this would not be necessary and it is actually not difficult to eliminate the rescan on buffer refill. Consequently, it should not be difficult to produce a version of flex with a push interface, but as far as I know, that has not been done.