Writing a compiler in its own language

219

votes

Intuitively, it would seems that a compiler for language Foo cannot itself be written in Foo. More specifically, the first compiler for language Foo cannot be written in Foo, but any subsequent compiler could be written for Foo.

But is this actually true? I have some very vague recollection of reading about a language whose first compiler was written in "itself". Is this possible, and if so how?

compiler-constructionbootstrapping

Possible duplicate of Bootstrapping still requires outside support - nbro

This is a very old question, but say I wrote an interpreter for language Foo in Java. Then with the language foo, I wrote it's own interpreter. Foo would still require the JRE right? - George Xavier

You could write the first Foo compiler in Foo itself. Your source code would be a Foo program with Foo instructions for how to generate machine code (or, in more modern terms, some other backend code) given a Foo source code input. Now, you would need something or someone that understands Foo's specification well enough to trace out the correct output of that program by hand, run on itself. As far as I know, however, precisely what I'm describing has never actually been done with any language, for obvious reasons. - Marcel Besixdouze

14 Answers

251

votes

This is called "bootstrapping". You must first build a compiler (or interpreter) for your language in some other language (usually Java or C). Once that is done, you can write a new version of the compiler in language Foo. You use the first bootstrap compiler to compile the compiler, and then use this compiled compiler to compile everything else (including future versions of itself).

Most languages are indeed created in this fashion, partially because language designers like to use the language they are creating, and also because a non-trivial compiler often serves as a useful benchmark for how "complete" the language may be.

An example of this would be Scala. Its first compiler was created in Pizza, an experimental language by Martin Odersky. As of version 2.0, the compiler was completely re-written in Scala. From that point on, the old Pizza compiler could be completely discarded, due to the fact that the new Scala compiler could be used to compile itself for future iterations.

votes

I recall listening to a Software Engineering Radio podcast wherein Dick Gabriel spoke about bootstrapping the original LISP interpreter by writing a bare-bones version in LISP on paper and hand assembling it into machine code. From then on, the rest of the LISP features were both written in and interpreted with LISP.

votes

When you write your first compiler for C, you write it in some other language. Now, you have a compiler for C in, say, assembler. Eventually, you will come to the place where you have to parse strings, specifically escape sequences. You will write code to convert \n to the character with the decimal code 10 (and \r to 13, etc).

After that compiler is ready, you will start to reimplement it in C. This process is called "bootstrapping".

The string parsing code will become:

...
if (c == 92) { // backslash
    c = getc();
    if (c == 110) { // n
        return 10;
    } else if (c == 92) { // another backslash
        return 92;
    } else {
        ...
    }
}
...

When this compiles, you have a binary which understands '\n'. This means you can change the source code:

...
if (c == '\\') {
    c = getc();
    if (c == 'n') {
        return '\n';
    } else if (c == '\\') {
        return '\\';
    } else {
        ...
    }
}
...

So where is the information that '\n' is the code for 13? It's in the binary! It's like DNA: Compiling C source code with this binary will inherit this information. If the compiler compiles itself, it will pass this knowledge on to its offspring. From this point on, there is no way to see from the source alone what the compiler will do.

If you want to hide a virus in the source of some program, you can do it like this: Get the source of a compiler, find the function which compiles functions and replace it with this one:

void compileFunction(char * name, char * filename, char * code) {
    if (strcmp("compileFunction", name) == 0 && strcmp("compile.c", filename) == 0) {
        code = A;
    } else if (strcmp("xxx", name) == 0 && strcmp("yyy.c", filename) == 0) {
        code = B;
    }

    ... code to compile the function body from the string in "code" ...
}

The interesting parts are A and B. A is the source code for compileFunction including the virus, probably encrypted in some way so it's not obvious from searching the resulting binary. This makes sure that compiling to compiler with itself will preserve the virus injection code.

B is the same for the function we want to replace with our virus. For example, it could be the function "login" in the source file "login.c" which is probably from the Linux kernel. We could replace it with a version that will accept the password "joshua" for the root account in addition to the normal password.

If you compile that and spread it as a binary, there will be no way to find the virus by looking at the source.

The original source of the idea: https://web.archive.org/web/20070714062657/http://www.acm.org/classics/sep95/

votes

Adding a curiosity to the previous answers.

Here's a quote from the Linux From Scratch manual, at the step where one starts building the GCC compiler from its source. (Linux From Scratch is a way to install Linux that is radically different from installing a distribution, in that you have to compile really every single binary of the target system.)

make bootstrap
The 'bootstrap' target does not just compile GCC, but compiles it several times. It uses the programs compiled in a first round to compile itself a second time, and then again a third time. It then compares these second and third compiles to make sure it can reproduce itself flawlessly. This also implies that it was compiled correctly.

That use of the 'bootstrap' target is motivated by the fact that the compiler one uses to build the target system's toolchain may not have the very same version of the target compiler. Proceeding in that way one is sure to obtain, in the target system, a compiler that can compile itself.

votes

You cannot write a compiler in itself because you have nothing to compile your starting source code with. There are two approaches to solving this.

The least favored is the following. You write a minimal compiler in assembler (yuck) for a minimal set of the language and then use that compiler to implement extra features of the language. Building your way up until you have a compiler with all the language features for itself. A painful process that is usually only done when you have no other choice.

The preferred approach is to use a cross compiler. You change the back end of an existing compiler on a different machine to create output that runs on the target machine. Then you have a nice full compiler up and working on the target machine. Most popular for this is the C language, as there are plenty of existing compilers that have pluggable back ends that can be swapped out.

A little known fact is that the GNU C++ compiler has an implementation that uses only the C subset. The reason being it is usually easy to find a C compiler for a new target machine that allows you to then build the full GNU C++ compiler from it. You have now boot strapped yourself to having a C++ compiler on the target machine.

votes

Generally, you need to have a working (if primative) cut of the compiler working first - then you can start thinking about making it self-hosting. This is actually considered an important milestone in some langauges.

From what I remember from "mono", it is likely they will need to add a few things to reflection to get it working: the mono team keep pointing out that some things simply aren't possible with Reflection.Emit; of course, the MS team might prove them wrong.

This has a few real advantages: it is a fairly good unit test, for starters! And you only have one language to worry about (i.e. it is possible a C# expert might not know much C++; but now thy can fix the C# compiler). But I wonder if there isn't an amount of professional pride at work here: they simply want it to be self-hosting.

Not quite a compiler, but I've recently been working on a system that is self hosting; the code generator is used to generate the code generator... so if the schema changes I simply run it on itself : new version. If there is a bug, I just go back to an earlier version and try again. Very convenient, and very easy to maintain.

Update 1

I've just watched this video of Anders at PDC, and (about an hour in) he does give some much more valid reasons - all about the compiler as a service. Just for the record.

votes

Here's a dump (difficult topic to search on, actually):

Smalltalk
C

This is also the idea of PyPy and Rubinius:

(I think this might also apply to Forth, but I don't know anything about Forth.)

votes

I wrote SLIC (System of Languages for Implementing Compilers) in itself. Then hand compiled it into assembly. There is a lot to SLIC as it was a single compiler of five sub-languages:

SYNTAX Parser Programming Language PPL
GENERATOR LISP 2 based tree-crawling PSEUDO code generation language
ISO In Sequence, PSEUDO code, Optimization language
PSEUDO Macro like Assembly code producing language.
MACHOP Assembly-machine instruction defining language.

SLIC was inspired by CWIC (Compiler for Writing and Implementing Compilers). Unlike most compiler development packages SLIC and CWIC addressed code generation with specialize, domain specific, languages. SLIC extends CWICs code generation adding the ISO, PSEUDO and MACHOP sub-languages separating target machine specifics out of the tree-crawling generator language.

LISP 2 trees and lists

The LISP 2 based generator language's dynamic memory management system is a key component. Lists are expressed in the language inclosed in square brackets, its components separated by commas i.e. a three element [a,b,c] list.

Trees:

     ADD
    /   \
  MPY     3
 /   \
5     x

are represented by lists whose first entry is a node object:

[ADD,[MPY,5,x],3]

Trees are commonly displayed with the node separate preceding the branches:

ADD[MPY[5,x],3]

Unparsing with LISP 2 based generator functions

A generator function is a named set of (unparse)=>action> pairs ...

<NAME>(<unparse>)=><action>;
      (<unparse>)=><action>;
            ...
      (<unparse>)=><action>;

Unparse expressions are tests that match tree patterns and/or object types breaking them apart and assigning those parts to local variable to be processed by its procedural action. Kind of like an overloaded function taking different argument types. Except the ()=> ... tests are attempted in the order coded. The first successful unparse executing its corresponding action. The unparse expressions are disassembling tests. ADD[x,y] matches a two branch ADD tree assigning its branches to local variables x and y. The action may be a simple expression or a .BEGIN ... .END bounded code block. I would use c style { ... } blocks today. Tree matching, [], unparse rules may call generators passing the returned result(s) to the action:

expr_gen(ADD[expr_gen(x),expr_gen(y)])=> x+y;

Specifically the above expr_gen unparse matches a two branch ADD tree. Within the test pattern a single argument generator placed in a tree branch will be called with that branch. Its argument list though are local variables assigned returned objects. Above the unparse specifies a two branch is ADD tree disassembly, recursive pressing each branch to expr_gen. The left branch return placed into into local variables x. Likewise the right branch passed to expr_gen with y the return object. The above could be part of a numeric expression evaluator. There were shortcut features called vectors were in the above instead of the node string a vector of nodes could be used with a vector of corresponding actions:

expr_gen(#node[expr_gen(x),expr_gen(y)])=> #action;

  node:   ADD, SUB, MPY, DIV;
  action: x+y, x-y, x*y, x/y;

        (NUMBER(x))=> x;
        (SYMBOL(x))=> val:(x);

The above more complete expression evaluator assigning the return from expr_gen left branch to x and the right branch to y. The corresponding action vector performed on x and y returned. The last unparse=>action pairs match numeric and symbol objects.

Symbol and symbol attributes

Symbols may have named attributes. val:(x) access the val attribute of the symbol object contained in x. A generalized symbol table stack is part of SLIC. The SYMBOL table may be pushed and popped providing local symbols for functions. Newly created symbol are cataloged in the top symbol table. Symbol lookup searches the symbol table stack from the top table first backward down the stack.

Generating machine independent code

SLIC's generator language produces PSEUDO instruction objects, appending them to a sections code list. A .FLUSH causes its PSEUDO code list to be run removing each PSEUDO instruction from the list and calling it. After execution a PSEUDO objects memory is released. The procedural bodies of PSEUDOs and GENERATOR actions are basically the same language except for their output. PSEUDO are meant to act as assembly macros providing machine independent code sequentialization. They provide a separation of the specific target machine out of the tree crawling generator language. PSEUDOs call MACHOP functions to output machine code. MACHOPs are used to define assembly pseudo ops (like dc, define constant etc) and machine instruction or a family of like formated instructions using vectored entry. They simply transform their parameters into a sequence of bit fields making up the instruction. MACHOP calls are meant to look like assembly and provide print formatting of the fields for when assembly is shown in the compile listing. In the example code I am using c style commenting that could be easily added but was not in the original languages. MACHOPs are producing code into a bit addressable memory. The SLIC linker handles output of the compiler. A MACHOP for the DEC-10 user mode instructions using vectored entry:

.MACHOP #opnm register,@indirect offset (index): // Instruction's parameters.
.MORG 36, O(18): $/36; // Align to 36 bit boundary print format: 18 bit octal $/36
O(9):  #opcd;          // Op code 9 bit octal print out
 (4):  register;       // 4 bit register field appended print
 (1):  indirect;       // 1 bit appended print
 (4):  index;          // 4 bit index register appended print
O(18): if (#opcd&&3==1) offset // immediate mode use value else
       else offset/36;         // memory address divide by 36
                               // to get word address.
// Vectored entry opcode table:
#opnm := MOVE, MOVEI, MOVEM, MOVES, MOVS, MOVSI, MOVSM, MOVSS,
         MOVN, MOVNI, MOVNM, MOVNS, MOVM, MOVMI, MOVMM, MOVMS,
         IMUL, IMULI, IMULM, IMULB, MUL,  MULI,  MULM,  MULB,
                           ...
         TDO,  TSO,   TDOE,  TSOE,  TDOA, TSOA,  TDON,  TSON;
// corresponding opcode value:
#opcd := 0O200, 0O201, 0O202, 0O203, 0O204, 0O205, 0O206, 0O207,
         0O210, 0O211, 0O212, 0O213, 0O214, 0O215, 0O216, 0O217,
         0O220, 0O221, 0O222, 0O223, 0O224, 0O225, 0O226, 0O227,
                           ...
         0O670, 0O671, 0O672, 0O673, 0O674, 0O675, 0O676, 0O677;

The .MORG 36, O(18): $/36; aligns the location to a 36 bit boundary printing the location $/36 word address of 18 bits in octal. The 9 bit opcd,4 bit register, indirect bit and 4 bit index register are combined and printed as if a single 18 bit field. The 18 bit address/36 or immediate value is output and printed in octal. A MOVEI example print out with r1 = 1 and r2=2:

400020 201082 000005            MOVEI r1,5(r2)

With the compiler assembly option you get the generated assembly code in the compile listing.

Link it together

The SLIC linker is supplied as a library that handles the linking and symbol resolutions. Target specific output load file formatting though must be written for target machines and linked with the linker library library.

The generator language is capable of writing trees to a file and reading them allowing a multipass compiler to be implemented.

Short summery of code generation and origins

I have went over the code generation first to insure it is understood that SLIC was a true compiler compiler. SLIC was inspired by CWIC (Compiler for Writing and Implementing Compilers) developed at Systems Development Corporation in the late 1960s. CWIC only had SYNTAX and GENERATOR languages producing numeric byte code out of the GENERATOR language. Byte code was placed or planted (the term used in CWICs documentation) into memory buffers associated with named sections and written out by a .FLUSH statement. An ACM paper on CWIC is available from the ACM archives.

Successfully implementing a major programming language

In the late 1970s SLIC was used to write a COBOL cross compiler. Completed in about 3 months mostly by a single programmer. I worked a bit with the programmer as needed. Another programer wrote the runtime library and MACHOPs for the target TI-990 mini-COMPUTER. That COBOL compiler compiled substantially more lines per second then the DEC-10 native COBOL compiler written in assembly.

More to a compiler then usually talked about

A big part of writing a compiler from scratch is the run time library. You need a symbol table. You need input and output. Dynamic memory management etc. It easily can be more work writing the runtime library for a compiler then writing the compiler. But with SLIC that runtime library is common to all compilers developen in SLIC. Note there are two runtime libraries. One for the language's (COBOL for example) target machine. The other is the compiler compilers runtime library.

I think I have established that these were not parser generators. So now with a little understanding of the back end I can explain the parser programming language.

Parser programming language

The parser is written using formula written in the form of simple equations.

<name> <formula type operator> <expression> ;

The language element at the lowest level is the character. Tokens are formed from a subsets of the characters of the language. Character classes are used to name and define those character subsets. The character class defining operator is the colon (:) character. Characters that are members of the class are coded on the right side of the definition. Printable characters are enclosed in primes single ' strings. Nonprinting and special characters may be represented by their numeric ordinal. Class members are separated by an alternative | operator. A class formula ends with a semicolon. Character classes may include previously defined classes:

/*  Character Class Formula                                    class_mask */
bin: '0'|'1';                                                // 0b00000010
oct: bin|'2'|'3'|'4'|'5'|'6'|'7';                            // 0b00000110
dgt: oct|'8'|'9';                                            // 0b00001110
hex: dgt|'A'|'B'|'C'|'D'|'E'|'F'|'a'|'b'|'c'|'d'|'e'|'f';    // 0b00011110
upr:  'A'|'B'|'C'|'D'|'E'|'F'|'G'|'H'|'I'|'J'|'K'|'L'|'M'|
      'N'|'O'|'P'|'Q'|'R'|'S'|'T'|'U'|'V'|'W'|'X'|'Y'|'Z';   // 0b00100000
lwr:  'a'|'b'|'c'|'d'|'e'|'f'|'g'|'h'|'i'|'j'|'k'|'l'|'m'|
      'n'|'o'|'p'|'q'|'r'|'s'|'t'|'u'|'v'|'w'|'x'|'y'|'z';   // 0b01000000
alpha:  upr|lwr;                                             // 0b01100000
alphanum: alpha|dgt;                                         // 0b01101110

The skip_class 0b00000001 is predefined but may be overroad be defining a skip_class.

In summary: A character class is a list of alternative that can only be a character constant, a character's ordinal, or a previously defined character class. As I implemented character classes: The class formula is assigned a class bit mask. (Shown in comments above) Any class formula having any character literal or ordinal causes a class bit to be allocated. A mask is made by oring the included class(es)'s class mask(s) together with the allocated bit (if any). A class table is created from the character classes. An entry indexed by a character's ordinal contains bits indicating the character's class memberships. Class testing is done inline. An IA-86 code example with the character's ordinal in eax illustrates class testing:

test    byte ptr [eax+_classmap],dgt

Followed by a:

jne      <success>

je       <failure>

IA-86 instruction code examples are used because I think IA-86 instructions are more widely known today. The class name evaluating to its class mask is non-destructively ANDed with the class-table indexed by the characters ordinal(in eax). A non-zero result indicates class membership. (EAX is zeroed except for al(the low 8 bits of EAX) that contains the character).

Tokens were a bit different in these old compilers. Key words were not explained as tokens. They simply were matched by quoted string constants in the parser language. Quoted strings are not normally kept. Modifiers may be used. A + keeps the string matched. (i.e. +'-' matches a - character keeping the character when successful) The , operation (i.e. ,'E') inserts the string into the token. White space is handled by the token formula skipping leading SKIP_CLASS characters until a first match is made. Note that an explicit skip_class character match will stop the skipping allowing a token to start with a skip_class character. The string token formula skips leading skip_class characters matching a single quote quitedd character or a double quoted string. Of interest is the matching a " character within a " quoted string:

string .. (''' .ANY ''' | '"' $(-"""" .ANY | """""","""") '"') MAKSTR[];

The first alternative matches any single quote quoted character. The right alternative matches a double quote quoted string that may include double quote characters using two " character together to represent a single " character. This formula defines the strings used in its own definition. The inner right alternative '"' $(-"""" .ANY | """""","""") '"' matches a double quote quoted string. We can use a single ' quoted character to match a double quote " character. However within the double " quoted string if we wish to use a " character we must use two " characters to get one. For example in the inner left alternative matching any character except a quote:

-"""" .ANY

a negative peek ahead -"""" is used that when successful (not matching a " character) then matches .ANY character (which can not be a " character because -"""" eliminated that possibility). The right alternative is taking on -"""" matching a " character and failing were the right alternative:

"""""",""""

tries to match two " characters replacing them with a single double " using ,"""" to inserting thw single " character. Both inner alternatives failing the closing string quote character is matched and MAKSTR[] called to create a string object. The $ sequence, loop while successful, operator is used in matching a sequence. Token formula skip leading skip class characters(whit space). Once a first match is made skip_class skipping is disabled. We can call functions programed in other languages using []. MAKSTR[], MAKBIN[], MAKOCT[], MAKHEX[], MAKFLOAT[], and MAKINT[] are supplied library function that convert a matched token string to a typed object. The number formula below illustrates a fairly complex token recognition:

number .. "0B" bin $bin MAKBIN[]        // binary integer
         |"0O" oct $oct MAKOCT[]        // octal integer
         |("0H"|"0X") hex $hex MAKHEX[] // hexadecimal integer
// look for decimal number determining if integer or floating point.
         | ('+'|+'-'|--)                // only - matters
           dgt $dgt                     // integer part
           ( +'.' $dgt                  // fractional part?
              ((+'E'|'e','E')           // exponent  part
               ('+'|+'-'|--)            // Only negative matters
               dgt(dgt(dgt|--)|--)|--)  // 1 2 or 3 digit exponent
             MAKFLOAT[] )               // floating point
           MAKINT[];                    // decimal integer

The above number token formula recognizes integer and floating point numbers. The -- alternatives are always successful. Numeric objects may be used in calculations. The token objects are pushed onto the parse stack on success of the formula. The exponent lead in (+'E'|'e','E') is interesting. We wish to always have an uppercase E for MAKEFLOAT[]. But we allow a lower case 'e' replacing it using ,'E'.

You may have noticed consistencies of character class and token formula. The parsing formula continue that adding backtracking alternatives and tree construction operators. Backtracking and non-backtracking alternative operators may not be mixed within an expression level. You may not have (a | b \ c) mixing non-backtracking | withe \ backtracking alternative. (a\b\c), (a|b|c) and ((a|b)\c) are valid. A \ backtracking alternative saves the parse state before attempting its left alternative and on failure restores the parse state before attempting the right alternative. In a sequence of alternatives the first successful alternative satisfies the group. Further alternatives are not attempted. Factoring and grouping provides for a continuous advancing parse. The backtrack alternative creates a saved state of the parse before it attempts its left alternative. Backtracking is required when the parse may make a partial match and then fail:

(a b | c d)\ e

In the above if a returns failure the alternative c d is attempted. If then c returns failure the backtrack alternative will be attempted. If a succeeds and b fails the parse wile be backtracked and e attempted. Likewise a failing c successful and b fails the parse is backtracked and the alternative e taken. Backtracking is not limited to within a formula. If any parsing formula makes a partial match at any time and then fails the parse is reset to the top backtrack and its alternative taken. A compile failure can occur if code has been output sense the backtrack was created. A backtrack is set before starting the compile. Returning failure or backtracking to it is a compiler failure. Backtracks are stacked. We may use negative - and positive ? peek/look ahead operators to test without advancing the parse. being string test is a peek ahead only needing the input state saved and reset. A look ahead would be a parsing expression that makes a partial match before failing. A look ahead is implemented using backtracking.

The parser language is neither an LL or LR parser. But a programming language for writing a recursive decent parser in which you program tree construction:

:<node name> creates a node object and pushes it onto the node stack.
..           Token formula create token objects and push them onto 
             the parse stack.
!<number>    pops the top node object and top <number> of parstack 
             entries into a list representation of the tree. The 
             tree then pushed onto the parse stack.
+[ ... ]+    creates a list of the parse stack entries created 
             between them:
              '(' +[argument $(',' argument]+ ')'
             could parse an argument list. into a list.

A commonly used parsing example is an arithmetic expression:

Exp = Term $(('+':ADD|'-':SUB) Term!2); 
Term = Factor $(('*':MPY|'/':DIV) Factor!2);
Factor = ( number
         | id  ( '(' +[Exp $(',' Exp)]+ ')' :FUN!2
               | --)
         | '(' Exp ')" )
         (^' Factor:XPO!2 |--);

Exp and Term using a loop creates a left handed tree. Factor using right recursion creates a right handed tree:

d^(x+5)^3-a+b*c => ADD[SUB[EXP[EXP[d,ADD[x,5]],3],a],MPY[b,c]]

              ADD
             /   \
          SUB     MPY
         /   \   /   \
      EXP     a b     c
     /   \
    d     EXP     
         /   \
      ADD     3
     /   \
    x     5

Here is a bit of the cc compiler, an updated version of SLIC with c style comments. Function types (grammar, token, character class, generator, PSEUDO, or MACHOP are determined by their initial syntax following their id. With these top-down parsers you start with a program defining formula:

program = $((declaration            // A program is a sequence of
                                    // declarations terminated by
            |.EOF .STOP)            // End Of File finish & stop compile
           \                        // Backtrack: .EOF failed or
                                    // declaration long-failed.
             (ERRORX["?Error?"]     // report unknown error
                                    // flagging furthest parse point.
              $(-';' (.ANY          // find a ';'. skiping .ANY
                     | .STOP))      // character: .ANY fails on end of file
                                    // so .STOP ends the compile.
                                    // (-';') failing breaks loop.
              ';'));                // Match ';' and continue

declaration =  "#" directive                // Compiler directive.
             | comment                      // skips comment text
             | global        DECLAR[*1]     // Global linkage
             |(id                           // functions starting with an id:
                ( formula    PARSER[*1]     // Parsing formula
                | sequencer  GENERATOR[*1]  // Code generator
                | optimizer  ISO[*1]        // Optimizer
                | pseudo_op  PRODUCTION[*1] // Pseudo instruction
                | emitor_op  MACHOP[*1]     // Machine instruction
                )        // All the above start with an identifier
              \ (ERRORX["Syntax error."]
                 garbol);                    // skip over error.

// Note how id is factored off and later combined when creating the tree.

formula =   ("==" syntax  :BCKTRAK   // backtrack grammar formula
            |'='  syntax  :SYNTAX    // grammar formula.
            |':'  chclass :CLASS     // character class define
            |".." token   :TOKEN     // token formula
              )';' !2                // Combine node name with id 
                                     // parsed in calling declaration 
                                     // formula and tree produced
                                     // by the called syntax, token
                                     // or character class formula.
                $(-(.NL |"/*") (.ANY|.STOP)); Comment ; to line separator?

chclass = +[ letter $('|' letter) ]+;// a simple list of character codes
                                     // except 
letter  = char | number | id;        // when including another class

syntax  = seq ('|' alt1|'\' alt2 |--);

alt1    = seq:ALT!2 ('|' alt1|--);  Non-backtrack alternative sequence.

alt2    = seq:BKTK!2 ('\' alt2|--); backtrack alternative sequence

seq     = +[oper $oper]+;

oper    = test | action | '(' syntax ')' | comment; 

test    = string | id ('[' (arg_list| ,NILL) ']':GENCALL!2|.EMPTY);

action  = ':' id:NODE!1
        | '!' number:MAKTREE!1
        | "+["  seq "]+" :MAKLST!1;

//     C style comments
comment  = "//" $(-.NL .ANY)
         | "/*" $(-"*/" .ANY) "*/";

Of note is how the parser language handles commenting and error recovery.

I think I have answered the question. Having written a big part of SLICs successor, the cc language in itself here. There is no compiler for it as yet. But I can hand compile it into assembly code, naked asm c or c++ functions.

votes

GNAT, the GNU Ada compiler, requires an Ada compiler to be fully built. This can be a pain when porting it to a platform where there is no GNAT binary readily available.

votes

Actually, most compilers are written in the language they compile, for the reasons stated above.

The first bootstrap compiler is usually written in C, C++ or Assembly.

votes

The Mono project C# compiler has been "self-hosted" for a long time now, what it means is that it has been written in C# itself.

What I know is that the compiler was started as pure C code, but once the "basic" features of ECMA were implemented they started to rewrite the compiler in C#.

I'm not aware of the advantages of writing the compiler in the same language, but I'm sure it has to do at least with the features that the language itself can offer (C, for example, does not support object oriented programming).

You can find more information here.

votes

Yes, you can write a compiler for a language in that language. No, you don't need a first compiler for that language to bootstrap.

What you need to bootstrap is an implementation of the language. That can be either a compiler or an interpreter.

Historically, languages were usually thought of as either interpreted languages or compiled languages. Interpreters were only written for the former and compilers were only written for the latter. So usually if a compiler was going to be written for a language, the first compiler would be written in some other language to bootstrap it, then, optionally, the compiler would be re-written for the subject language. But writing an interpreter in another language instead is an option.

This isn't just theoretical. I happen to currently be doing this myself. I am working on a compiler for a language, Salmon, that I developed myself. I first created a Salmon compiler in C and now I'm writing the compiler in Salmon, so I can get the Salmon compiler working without ever having a compiler for Salmon written in any other language.

votes

Notice that technically you can write a compiler in a language that is still not there. In order to do this you create an interpreter, a subpar of original language, that is slow and useless in general as it interprets each statement of the language, before it executes anything.

It does look completely like the intended language, if you read it, but its execution goes over some process that is converting it into executable in more than one step.

This compiler is typically horribly slow, as it uses some generic mathematical procedure that is applicable to almost any existing language, but the advantage is that you do nothing next time around except use the produced compiler over the existing code.

This time of course without interpreting it.

-1

votes

Maybe you can write a BNF describing BNF.