I wrote SLIC (System of Languages for Implementing Compilers) in itself. Then hand compiled it into assembly. There is a lot to SLIC as it was a single compiler of five sub-languages:
- SYNTAX Parser Programming Language PPL
- GENERATOR LISP 2 based tree-crawling PSEUDO code generation language
- ISO In Sequence, PSEUDO code, Optimization language
- PSEUDO Macro like Assembly code producing language.
- MACHOP Assembly-machine instruction defining language.
SLIC was inspired by CWIC (Compiler for Writing and Implementing Compilers). Unlike most compiler development packages SLIC and CWIC addressed code generation with specialize, domain specific, languages. SLIC extends CWICs code generation adding the ISO, PSEUDO and MACHOP sub-languages separating target machine specifics out of the tree-crawling generator language.
LISP 2 trees and lists
The LISP 2 based generator language's dynamic memory management system is a key component. Lists are expressed in the language inclosed in square brackets, its components separated by commas i.e. a three element [a,b,c] list.
Trees:
ADD
/ \
MPY 3
/ \
5 x
are represented by lists whose first entry is a node object:
[ADD,[MPY,5,x],3]
Trees are commonly displayed with the node separate preceding the branches:
ADD[MPY[5,x],3]
Unparsing with LISP 2 based generator functions
A generator function is a named set of (unparse)=>action> pairs ...
<NAME>(<unparse>)=><action>;
(<unparse>)=><action>;
...
(<unparse>)=><action>;
Unparse expressions are tests that match tree patterns and/or object types breaking them apart and assigning those parts to local variable to be processed by its procedural action. Kind of like an overloaded function taking different argument types. Except the ()=> ... tests are attempted in the order coded. The first successful unparse executing its corresponding action. The unparse expressions are disassembling tests. ADD[x,y] matches a two branch ADD tree assigning its branches to local variables x and y. The action may be a simple expression or a .BEGIN ... .END bounded code block. I would use c style { ... } blocks today. Tree matching, [], unparse rules may call generators passing the returned result(s) to the action:
expr_gen(ADD[expr_gen(x),expr_gen(y)])=> x+y;
Specifically the above expr_gen unparse matches a two branch ADD tree. Within the test pattern a single argument generator placed in a tree branch will be called with that branch. Its argument list though are local variables assigned returned objects. Above the unparse specifies a two branch is ADD tree disassembly, recursive pressing each branch to expr_gen. The left branch return placed into into local variables x. Likewise the right branch passed to expr_gen with y the return object. The above could be part of a numeric expression evaluator. There were shortcut features called vectors were in the above instead of the node string a vector of nodes could be used with a vector of corresponding actions:
expr_gen(#node[expr_gen(x),expr_gen(y)])=> #action;
node: ADD, SUB, MPY, DIV;
action: x+y, x-y, x*y, x/y;
(NUMBER(x))=> x;
(SYMBOL(x))=> val:(x);
The above more complete expression evaluator assigning the return from expr_gen left branch to x and the right branch to y. The corresponding action vector performed on x and y returned. The last unparse=>action pairs match numeric and symbol objects.
Symbol and symbol attributes
Symbols may have named attributes. val:(x) access the val attribute of the symbol object contained in x. A generalized symbol table stack is part of SLIC. The SYMBOL table may be pushed and popped providing local symbols for functions. Newly created symbol are cataloged in the top symbol table. Symbol lookup searches the symbol table stack from the top table first backward down the stack.
Generating machine independent code
SLIC's generator language produces PSEUDO instruction objects, appending them to a sections code list. A .FLUSH causes its PSEUDO code list to be run removing each PSEUDO instruction from the list and calling it. After execution a PSEUDO objects memory is released. The procedural bodies of PSEUDOs and GENERATOR actions are basically the same language except for their output. PSEUDO are meant to act as assembly macros providing machine independent code sequentialization. They provide a separation of the specific target machine out of the tree crawling generator language. PSEUDOs call MACHOP functions to output machine code. MACHOPs are used to define assembly pseudo ops (like dc, define constant etc) and machine instruction or a family of like formated instructions using vectored entry. They simply transform their parameters into a sequence of bit fields making up the instruction. MACHOP calls are meant to look like assembly and provide print formatting of the fields for when assembly is shown in the compile listing. In the example code I am using c style commenting that could be easily added but was not in the original languages. MACHOPs are producing code into a bit addressable memory. The SLIC linker handles output of the compiler. A MACHOP for the DEC-10 user mode instructions using vectored entry:
.MACHOP #opnm register,@indirect offset (index): // Instruction's parameters.
.MORG 36, O(18): $/36; // Align to 36 bit boundary print format: 18 bit octal $/36
O(9): #opcd; // Op code 9 bit octal print out
(4): register; // 4 bit register field appended print
(1): indirect; // 1 bit appended print
(4): index; // 4 bit index register appended print
O(18): if (#opcd&&3==1) offset // immediate mode use value else
else offset/36; // memory address divide by 36
// to get word address.
// Vectored entry opcode table:
#opnm := MOVE, MOVEI, MOVEM, MOVES, MOVS, MOVSI, MOVSM, MOVSS,
MOVN, MOVNI, MOVNM, MOVNS, MOVM, MOVMI, MOVMM, MOVMS,
IMUL, IMULI, IMULM, IMULB, MUL, MULI, MULM, MULB,
...
TDO, TSO, TDOE, TSOE, TDOA, TSOA, TDON, TSON;
// corresponding opcode value:
#opcd := 0O200, 0O201, 0O202, 0O203, 0O204, 0O205, 0O206, 0O207,
0O210, 0O211, 0O212, 0O213, 0O214, 0O215, 0O216, 0O217,
0O220, 0O221, 0O222, 0O223, 0O224, 0O225, 0O226, 0O227,
...
0O670, 0O671, 0O672, 0O673, 0O674, 0O675, 0O676, 0O677;
The .MORG 36, O(18): $/36; aligns the location to a 36 bit boundary printing the location $/36 word address of 18 bits in octal. The 9 bit opcd,4 bit register, indirect bit and 4 bit index register are combined and printed as if a single 18 bit field. The 18 bit address/36 or immediate value is output and printed in octal. A MOVEI example print out with r1 = 1 and r2=2:
400020 201082 000005 MOVEI r1,5(r2)
With the compiler assembly option you get the generated assembly code in the compile listing.
Link it together
The SLIC linker is supplied as a library that handles the linking and symbol resolutions. Target specific output load file formatting though must be written for target machines and linked with the linker library library.
The generator language is capable of writing trees to a file and reading them allowing a multipass compiler to be implemented.
Short summery of code generation and origins
I have went over the code generation first to insure it is understood that SLIC was a true compiler compiler. SLIC was inspired by CWIC (Compiler for Writing and Implementing Compilers) developed at Systems Development Corporation in the late 1960s. CWIC only had SYNTAX and GENERATOR languages producing numeric byte code out of the GENERATOR language. Byte code was placed or planted (the term used in CWICs documentation) into memory buffers associated with named sections and written out by a .FLUSH statement. An ACM paper on CWIC is available from the ACM archives.
Successfully implementing a major programming language
In the late 1970s SLIC was used to write a COBOL cross compiler. Completed in about 3 months mostly by a single programmer. I worked a bit with the programmer as needed. Another programer wrote the runtime library and MACHOPs for the target TI-990 mini-COMPUTER. That COBOL compiler compiled substantially more lines per second then the DEC-10 native COBOL compiler written in assembly.
More to a compiler then usually talked about
A big part of writing a compiler from scratch is the run time library. You need a symbol table. You need input and output. Dynamic memory management etc. It easily can be more work writing the runtime library for a compiler then writing the compiler. But with SLIC that runtime library is common to all compilers developen in SLIC. Note there are two runtime libraries. One for the language's (COBOL for example) target machine. The other is the compiler compilers runtime library.
I think I have established that these were not parser generators. So now with a little understanding of the back end I can explain the parser programming language.
Parser programming language
The parser is written using formula written in the form of simple equations.
<name> <formula type operator> <expression> ;
The language element at the lowest level is the character. Tokens are formed from a subsets of the characters of the language. Character classes are used to name and define those character subsets. The character class defining operator is the colon (:) character. Characters that are members of the class are coded on the right side of the definition. Printable characters are enclosed in primes single ' strings. Nonprinting and special characters may be represented by their numeric ordinal. Class members are separated by an alternative | operator. A class formula ends with a semicolon. Character classes may include previously defined classes:
/* Character Class Formula class_mask */
bin: '0'|'1'; // 0b00000010
oct: bin|'2'|'3'|'4'|'5'|'6'|'7'; // 0b00000110
dgt: oct|'8'|'9'; // 0b00001110
hex: dgt|'A'|'B'|'C'|'D'|'E'|'F'|'a'|'b'|'c'|'d'|'e'|'f'; // 0b00011110
upr: 'A'|'B'|'C'|'D'|'E'|'F'|'G'|'H'|'I'|'J'|'K'|'L'|'M'|
'N'|'O'|'P'|'Q'|'R'|'S'|'T'|'U'|'V'|'W'|'X'|'Y'|'Z'; // 0b00100000
lwr: 'a'|'b'|'c'|'d'|'e'|'f'|'g'|'h'|'i'|'j'|'k'|'l'|'m'|
'n'|'o'|'p'|'q'|'r'|'s'|'t'|'u'|'v'|'w'|'x'|'y'|'z'; // 0b01000000
alpha: upr|lwr; // 0b01100000
alphanum: alpha|dgt; // 0b01101110
The skip_class 0b00000001 is predefined but may be overroad be defining a skip_class.
In summary: A character class is a list of alternative that can only be a character constant, a character's ordinal, or a previously defined character class. As I implemented character classes: The class formula is assigned a class bit mask. (Shown in comments above) Any class formula having any character literal or ordinal causes a class bit to be allocated. A mask is made by oring the included class(es)'s class mask(s) together with the allocated bit (if any). A class table is created from the character classes. An entry indexed by a character's ordinal contains bits indicating the character's class memberships. Class testing is done inline. An IA-86 code example with the character's ordinal in eax illustrates class testing:
test byte ptr [eax+_classmap],dgt
Followed by a:
jne <success>
or
je <failure>
IA-86 instruction code examples are used because I think IA-86 instructions are more widely known today. The class name evaluating to its class mask is non-destructively ANDed with the class-table indexed by the characters ordinal(in eax). A non-zero result indicates class membership. (EAX is zeroed except for al(the low 8 bits of EAX) that contains the character).
Tokens were a bit different in these old compilers. Key words were not explained as tokens. They simply were matched by quoted string constants in the parser language. Quoted strings are not normally kept. Modifiers may be used. A + keeps the string matched. (i.e. +'-' matches a - character keeping the character when successful) The , operation (i.e. ,'E') inserts the string into the token. White space is handled by the token formula skipping leading SKIP_CLASS characters until a first match is made. Note that an explicit skip_class character match will stop the skipping allowing a token to start with a skip_class character. The string token formula skips leading skip_class characters matching a single quote quitedd character or a double quoted string. Of interest is the matching a " character within a " quoted string:
string .. (''' .ANY ''' | '"' $(-"""" .ANY | """""","""") '"') MAKSTR[];
The first alternative matches any single quote quoted character. The right alternative matches a double quote quoted string that may include double quote characters using two " character together to represent a single " character. This formula defines the strings used in its own definition. The inner right alternative '"' $(-"""" .ANY | """""","""") '"' matches a double quote quoted string. We can use a single ' quoted character to match a double quote " character. However within the double " quoted string if we wish to use a " character we must use two " characters to get one. For example in the inner left alternative matching any character except a quote:
-"""" .ANY
a negative peek ahead -"""" is used that when successful (not matching a " character) then matches .ANY character (which can not be a " character because -"""" eliminated that possibility). The right alternative is taking on -"""" matching a " character and failing were the right alternative:
"""""",""""
tries to match two " characters replacing them with a single double " using ,"""" to inserting thw single " character. Both inner alternatives failing the closing string quote character is matched and MAKSTR[] called to create a string object. The $ sequence, loop while successful, operator is used in matching a sequence. Token formula skip leading skip class characters(whit space). Once a first match is made skip_class skipping is disabled. We can call functions programed in other languages using []. MAKSTR[], MAKBIN[], MAKOCT[], MAKHEX[], MAKFLOAT[], and MAKINT[] are supplied library function that convert a matched token string to a typed object. The number formula below illustrates a fairly complex token recognition:
number .. "0B" bin $bin MAKBIN[] // binary integer
|"0O" oct $oct MAKOCT[] // octal integer
|("0H"|"0X") hex $hex MAKHEX[] // hexadecimal integer
// look for decimal number determining if integer or floating point.
| ('+'|+'-'|--) // only - matters
dgt $dgt // integer part
( +'.' $dgt // fractional part?
((+'E'|'e','E') // exponent part
('+'|+'-'|--) // Only negative matters
dgt(dgt(dgt|--)|--)|--) // 1 2 or 3 digit exponent
MAKFLOAT[] ) // floating point
MAKINT[]; // decimal integer
The above number token formula recognizes integer and floating point numbers. The -- alternatives are always successful. Numeric objects may be used in calculations. The token objects are pushed onto the parse stack on success of the formula. The exponent lead in (+'E'|'e','E') is interesting. We wish to always have an uppercase E for MAKEFLOAT[]. But we allow a lower case 'e' replacing it using ,'E'.
You may have noticed consistencies of character class and token formula. The parsing formula continue that adding backtracking alternatives and tree construction operators. Backtracking and non-backtracking alternative operators may not be mixed within an expression level. You may not have (a | b \ c) mixing non-backtracking | withe \ backtracking alternative. (a\b\c), (a|b|c) and ((a|b)\c) are valid. A \ backtracking alternative saves the parse state before attempting its left alternative and on failure restores the parse state before attempting the right alternative. In a sequence of alternatives the first successful alternative satisfies the group. Further alternatives are not attempted. Factoring and grouping provides for a continuous advancing parse. The backtrack alternative creates a saved state of the parse before it attempts its left alternative. Backtracking is required when the parse may make a partial match and then fail:
(a b | c d)\ e
In the above if a returns failure the alternative c d is attempted. If then c returns failure the backtrack alternative will be attempted. If a succeeds and b fails the parse wile be backtracked and e attempted. Likewise a failing c successful and b fails the parse is backtracked and the alternative e taken. Backtracking is not limited to within a formula. If any parsing formula makes a partial match at any time and then fails the parse is reset to the top backtrack and its alternative taken. A compile failure can occur if code has been output sense the backtrack was created. A backtrack is set before starting the compile. Returning failure or backtracking to it is a compiler failure. Backtracks are stacked. We may use negative - and positive ? peek/look ahead operators to test without advancing the parse. being string test is a peek ahead only needing the input state saved and reset. A look ahead would be a parsing expression that makes a partial match before failing. A look ahead is implemented using backtracking.
The parser language is neither an LL or LR parser. But a programming language for writing a recursive decent parser in which you program tree construction:
:<node name> creates a node object and pushes it onto the node stack.
.. Token formula create token objects and push them onto
the parse stack.
!<number> pops the top node object and top <number> of parstack
entries into a list representation of the tree. The
tree then pushed onto the parse stack.
+[ ... ]+ creates a list of the parse stack entries created
between them:
'(' +[argument $(',' argument]+ ')'
could parse an argument list. into a list.
A commonly used parsing example is an arithmetic expression:
Exp = Term $(('+':ADD|'-':SUB) Term!2);
Term = Factor $(('*':MPY|'/':DIV) Factor!2);
Factor = ( number
| id ( '(' +[Exp $(',' Exp)]+ ')' :FUN!2
| --)
| '(' Exp ')" )
(^' Factor:XPO!2 |--);
Exp and Term using a loop creates a left handed tree. Factor using right recursion creates a right handed tree:
d^(x+5)^3-a+b*c => ADD[SUB[EXP[EXP[d,ADD[x,5]],3],a],MPY[b,c]]
ADD
/ \
SUB MPY
/ \ / \
EXP a b c
/ \
d EXP
/ \
ADD 3
/ \
x 5
Here is a bit of the cc compiler, an updated version of SLIC with c style comments. Function types (grammar, token, character class, generator, PSEUDO, or MACHOP are determined by their initial syntax following their id.
With these top-down parsers you start with a program defining formula:
program = $((declaration // A program is a sequence of
// declarations terminated by
|.EOF .STOP) // End Of File finish & stop compile
\ // Backtrack: .EOF failed or
// declaration long-failed.
(ERRORX["?Error?"] // report unknown error
// flagging furthest parse point.
$(-';' (.ANY // find a ';'. skiping .ANY
| .STOP)) // character: .ANY fails on end of file
// so .STOP ends the compile.
// (-';') failing breaks loop.
';')); // Match ';' and continue
declaration = "#" directive // Compiler directive.
| comment // skips comment text
| global DECLAR[*1] // Global linkage
|(id // functions starting with an id:
( formula PARSER[*1] // Parsing formula
| sequencer GENERATOR[*1] // Code generator
| optimizer ISO[*1] // Optimizer
| pseudo_op PRODUCTION[*1] // Pseudo instruction
| emitor_op MACHOP[*1] // Machine instruction
) // All the above start with an identifier
\ (ERRORX["Syntax error."]
garbol); // skip over error.
// Note how id is factored off and later combined when creating the tree.
formula = ("==" syntax :BCKTRAK // backtrack grammar formula
|'=' syntax :SYNTAX // grammar formula.
|':' chclass :CLASS // character class define
|".." token :TOKEN // token formula
)';' !2 // Combine node name with id
// parsed in calling declaration
// formula and tree produced
// by the called syntax, token
// or character class formula.
$(-(.NL |"/*") (.ANY|.STOP)); Comment ; to line separator?
chclass = +[ letter $('|' letter) ]+;// a simple list of character codes
// except
letter = char | number | id; // when including another class
syntax = seq ('|' alt1|'\' alt2 |--);
alt1 = seq:ALT!2 ('|' alt1|--); Non-backtrack alternative sequence.
alt2 = seq:BKTK!2 ('\' alt2|--); backtrack alternative sequence
seq = +[oper $oper]+;
oper = test | action | '(' syntax ')' | comment;
test = string | id ('[' (arg_list| ,NILL) ']':GENCALL!2|.EMPTY);
action = ':' id:NODE!1
| '!' number:MAKTREE!1
| "+[" seq "]+" :MAKLST!1;
// C style comments
comment = "//" $(-.NL .ANY)
| "/*" $(-"*/" .ANY) "*/";
Of note is how the parser language handles commenting and error recovery.
I think I have answered the question. Having written a big part of SLICs successor, the cc language in itself here. There is no compiler for it as yet. But I can hand compile it into assembly code, naked asm c or c++ functions.
Foo
compiler inFoo
itself. Your source code would be aFoo
program withFoo
instructions for how to generate machine code (or, in more modern terms, some other backend code) given aFoo
source code input. Now, you would need something or someone that understandsFoo
's specification well enough to trace out the correct output of that program by hand, run on itself. As far as I know, however, precisely what I'm describing has never actually been done with any language, for obvious reasons. – Marcel Besixdouze