You can take a look at some already-made assemblers, like PASMO: an assmbler for Z80 CPU, and get ideas from it. Here it is:
http://pasmo.speccy.org/
I've written a couple of very simple assemblers, both of them using string manipulation with strtok() and the like. For a simple grammar like the assembly language is, it's enough. Key pieces of my assemblers are:
A symbol table: just an array of structs, with the name of a symbol and its value.
typedef struct
{
char nombre[256];
u8 valor;
} TSymbol;
TSymbol tablasim[MAXTABLA];
int maxsim = 0;
A symbol is just a name that have associated a value. This value can be the current position (the address where the next instruction will be assembled), or it can be an explicit value assigned by the EQU
pseudoinstruction.
Symbol names in this implementation are limited to 255 characters each, and one source file is limited to MAXTABLA
symbols.
I perform two passes to the source code:
The first one is to identify symbols and store them in the symbol table, detecting whether they are followed by an EQU
instruction or not. If there is such, the value next to EQU
is parsed and assigned to the symbol. In other case, the value of the current position is assigned. To update the current position I have to detect if there is a valid instruction (although I do not assemble it yet) and update it acordingly (this is easy for me because my CPU has a fixed instruction size).
Here you have a sample of my code that is in charge of updating the symbol table with a value from EQU of the current position, and advancing the current position if needed.
case 1:
if (es_equ (token))
{
token = strtok (NULL, "\n");
tablasim[maxsim].valor = parse_numero (token, &err);
if (err)
{
if (err==1)
fprintf (stderr, "Error de sintaxis en linea %d\n", nlinea);
else if (err==2)
fprintf (stderr, "Simbolo [%s] no encontrado en linea %d\n", token, nlinea);
estado = 2;
}
else
{
maxsim++;
token = NULL;
estado = 0;
}
}
else
{
tablasim[maxsim].valor = pcounter;
maxsim++;
if (es_instruccion (token))
pcounter++;
token = NULL;
estado = 0;
}
break;
The second pass is where I actually assemble instructions, replacing a symbol with its value when I find one. It's rather simple, using strtok()
to split a line into its components, and using strncasecmp()
to compare what I find with instruction mnemonics
yacc
andlex
. – Devolus