1
votes

I am trying to interact, mostly, writing an interpreter, for a language, I have only an API through a dll that permits me to compile files or strings, check syntax errors, etc. What I would like is really parsing the syntax. I have a user manual-level (i.e., no real grammar) specifications of the language, and I already wrote an incomplete parser using a context free grammar I wrote and lex yacc. But it's still choking out, and I'm adding so much weird regex and exceptions to rules in there that I think I'll never get to it that way.

I've looked (PE explorer) into the dll and found export entries matching an antlr3-generated lexer-parser-recognizer (well, multiple recognizers). I set out to build an interface to the functions in the dll (using ctypes in Python). I started with a dummy empty grammar, generated the headers, then "compiled" the antlr3*.h, LangLexer, LangParser headers to python with ctypesgen, and then I rebuilt an example found on Stack Overflow. I'm advancing but not sure how I would go about making a syntax tree without knowing the full grammar (though I know the name of the tokens). Got any clues?

1
And why can you not build a grammar from the langauge reference documentation? (Apparantly the DLL authors did). I doubt you'll be able extract a grammar easily from the DLL binary, which is what it sounds like you are starting to attempt to do. You certainly cannot make a syntax tree without a grammar. ... are you trying to build a grammar for Matlab? - Ira Baxter
I made a grammar from the doc (not in antlr though, but in PLY), but it's not handling quirks of the language well (stuff that are not really even explained in the doc how they handle it AFAIK). Lots of backtracking, multiple layers of parsing. Apparently even the devs themselves had issues writing the dll (I can see them asking for help on some internet forums). This is a language from a big company for their proprietary products, I'm more interested in it for curiosity and am not professional programmer - rienafairefr
I build parsers for a living, including for proprietary languages. They (esp. the proprietary ones) always have quirks like that. The only good cure is you get to be expert at the language; you can then hazard a pretty good guess about such stuff. In the end, you only get to guess at the grammar and run lots of code, if you can get it, through the parser. I doubt you are going to extract a grammar from the code in the DLL without superhuman effort in understanding how ANTLR generates code, and how that code is compiled. - Ira Baxter
Having a token list as a starting place is actually a big help, if you believe it to be correct. Do you know what the lexical definintions are behind each token, and whether any tokens are used under mutually exclusive circumstances? (e.g., does the lexer for this language have modes? Many screwball legacy languages do). - Ira Baxter
Thanks for your help! - rienafairefr

1 Answers

0
votes

Well, let that be a lesson, I found and contacted the developper who wrote the code for the parser, and he sent me the grammar. Trying to go to the source is a better solution than deconstructing binary.