Sounds like you have it but maybe the missing link you are overlooking.
First off assembly language is defined by the assembler, the program, not the target. So there could potentially be as many different MIPS assembly languages as there are folks willing to write assemblers. There aren't, fortunately, but there are some variations. Most of the places they vary are not the mnemonics/instructions. In the case of MIPS including the pseudo instruction la. But as shown in comments things like %hi and %lo and .asciiz
are the kind of thing that don't necessarily span across all assemblers for MIPS, nor need to so long as la
is. $a0, $v0 register names are not required either.
A pseudo instruction in this case means that the assembler replaces it with real instructions. The assemblers job is to make real instructions, machine instructions/code, or do the best it can. A toolchain will include ideally a compiler, assembler and linker, so The C compiler turns the C code into assembly the assembler turns that into an object, and take one or more objects and link those into a binary that has ideally resolved all externals (labels).
Different instruction sets have different features/rules. Some specifically talk about addressing modes, some do not. But addresses drive some percentage of the work, when you write C code, the name of the function, initially the name of variables, become labels, labels are addresses. Now an optimizer may remove the instance of a memory location for each of these things and their label and as a result address goes away, but if they do not they are an address. So when you have a call to a function the address needs to be figured out by the toolchain at build time (there are relocation exceptions, but in those cases the toolchain still figured it out relative to a base address that the relocation code has to patch for the toolchains output to work).
Sometimes the addresses are pc relative, program counter an internal register (or these days a set of registers) that keep track of the program, as a programmer reading some listing:
00000000 <.text>:
0: e3a01001 mov r1, #1
4: e3a02004 mov r2, #4
8: e0813002 add r3, r1, r2
(this is intentionally not MIPS)
So as a programmer we think at address 0 is an instruction mov r1,#4
and we then think the program counter is related to that address 0. Some instruction sets the program counter is a register we can access directly as a named register, others you cannot access it directly but perhaps indirectly with a special instruction, and some you cannot get at it with an instruction, but you can still have pc relative addressing in some form or fashion.
As you have seen in MIPS is not uncommon where there is a limited number of bits available in particular instructions for immediates. Constants within the instruction that provide a value to the instruction as a number. As above the last so many bits of the first two instructions 1 and 4 are related to the values in the mov. But as with MIPS being a fixed length instruction at 32 bits you can't have a 32 bit constant and also have opcode bits. So you have to find some solution to deal with loading constants.
Some instruction sets are variable length meaning they might have a one byte long instruction, think x86. Others are fixed length, think MIPS, ARM, risc-v, although all three of those have different sized instructions and different ways to use the different sized instructions, but their core instruction sets are/were fixed 32 bit instructions. What you would end up within many of the variable length instruction sets is say the address was 0x12345678 as the toolchain, likely the linker at this point, figured out where things were being placed. Let's say GG and JJ are the opcode bytes for some instruction to load a constant into a specific register and at this point this is now simply a constant it is no longer an address we just need those bits in the register
0xGG 0xJJ 0x01 0x23 0x34 0x56 0x78
might be that instruction.
Other instruction sets will try to find what is sometimes called a pool and place the constants nearby, you will often see this with fixed length instruction sets, but can sometimes depending on the instruction set code it yourself.
ldr r0,=labelname
nop
b somewhere
is the technically assembler (not target) specific pseudocode for a particular instruction set. The assembler sees that there is an unconditional branch which means unless the programmer is doing something hacky, you cant execute the byte(s) after that branch. And let's state that this label labelname is external it is not found in this code being assembled at this time into this object. So the toolchain is going to have to fill it in later, the assembler will take all of this information and at assemble time provide a place where the linker can fill in the address once known
00000000 <.text>:
0: e59f0004 ldr r0, [pc, #4] ; c <.text+0xc>
4: e1a00000 nop ; (mov r0, r0)
8: eafffffe b 0 <somewhere>
c: 00000000 andeq r0, r0, r0
the disassembly of the OBJECT. Which is not linked and at least for disassembling purposes uses a base address of zero, once linked this code would most likely not live at address zero. But at address/offce C there are zeros that once linked will be filled in by an address, and a pc-relative addressing mode is used which means at the time this instruction is executed math is done on the program counter to produce an address, that address is read and the contents of that address are used, in this case to be put in the general purpose register r0. (most instruction sets don't have an always zero register like MIPS and risc-v which was heavily influenced by MIPS, so r0 here is a general purpose register not the always zero register). How that math works for this instruction set such that 4 is the right value is a longer discussion.
It is not the simulator that turns la into one or more instructions it is the assembler, the simulator you are using first has to assemble the code into machine code then it can simulate those instructions. Be it a simulator or real processor (okay sure someone could create one that doesn't make machine code out of it but just parses and simulates from the assembly language, fine, but in general) this is the case.
As you have figured out MIPS solution for general constants is there is an instruction that can load half the register and make the other half zeros then you can use ori or add to change the lower half of the register as a pair of instructions.
la $2,0x12345678
la $2,0x12340000
la $2,0x00005678
la $2,0x10000008
If I use a/the gnu cross assembler (part of binutils)(relatively easy to come by for the major operating systems):
mips-elf-as so.s -o so.o
mips-elf-objdump -D so.o
gives
Disassembly of section .text:
00000000 <.text>:
0: 3c021234 lui $2,0x1234
4: 34425678 ori $2,$2,0x5678
8: 3c021234 lui $2,0x1234
c: 24025678 li $2,22136
10: 3c021000 lui $2,0x1000
14: 34420008 ori $2,$2,0x8
The every nibble is non-zero 0x12345678 took two instructions as expected, the
0x12340000 took one, the 0x00005678 (22136, why do disassemblers do this? who knows) is one instruction note it is neither lui nor ori nor add. And the 0x10000008 took two also as expected.
Also note this assembler did not use the scratch register. Also note that this assembler optimized those pseudo instructions into a mixture of solutions, it did try to use one instruction where possible, didn't have to, there isn't a rule the assembler could have always encoded an lui followed by an ori or add, it could have used a second scratch register or not. Your research found the use of an other register as a solution.
Hopefully your brain is putting some of these things together, okay so if the address is external and not known until link time, then is it possible to optimize? And even worse if possible to optimize then doesn't that change the number of instructions and thus size of the object and thus the size of the program making all the addresses that follow this instruction possibly be a value of 4 shorter which every so often will take an address that got lucky 0x12340000 now become 0x1233FFFC and now take two instructions instead of one. Yes all of that can happen, but toolchains deal with it. Let's try. I feel it is very good to just know what you are looking at and without having to run any code, you can learn a bunch about the toolchain and the instruction set:
la $2,some_ext_label
Disassembly of section .text:
00000000 <.text>:
0: 3c020000 lui $2,0x0
4: 24420000 addiu $2,$2,0
At the object level the assembler sees this as an external label cannot determine if there is an optimization so pretty much needs to encode the basic two instructions. Note that the actual values are left zeros, to complete the task it needs to put something there so in this case it just puts zeros.
Now to link this I need an actual label, so:
.globl some_ext_label
add $3,$4,$5
some_ext_label:
add $3,$4,$5
add $4,$5,$6
build it, ignore the linker warning about _start:
mips-elf-as ex.s -o ex.o
mips-elf-as ex.s -o ex.o
mips-elf-ld -Ttext=0x1000 so.o ex.o -o so.elf
mips-elf-objdump -D so.elf
gives:
Disassembly of section .text:
00001000 <_ftext>:
1000: 3c020000 lui $2,0x0
1004: 2442100c addiu $2,$2,4108
1008: 00851820 add $3,$4,$5
0000100c <some_ext_label>:
100c: 00851820 add $3,$4,$5
1010: 00a62020 add $4,$5,$6
the linker as it put the objects together starting at the specified address the label some_ext_label landed at address 0x0000100C then the linker goes back and through object file information/communication between the tools, patched up the instructions that needed their external address resolved. And note that if we had used la with a constant 0x0000100C we know this assembler would have optimized it but since the constant was not known until link time after the assembler had finished and made an object, it would have been difficult to optimize that instruction out because of the affect that would have on all the other offsets and addresses across the binary.
It needed to be able to deal with full 32 bit values:
mips-elf-as ex.s -o ex.o
mips-elf-as ex.s -o ex.o
mips-elf-ld -Ttext=0x87654444 so.o ex.o -o so.elf
mips-elf-objdump -D so.elf
87654444 <_ftext>:
87654444: 3c028765 lui $2,0x8765
87654448: 24424450 addiu $2,$2,17488
8765444c: 00851820 add $3,$4,$5
87654450 <some_ext_label>:
87654450: 00851820 add $3,$4,$5
87654454: 00a62020 add $4,$5,$6
See how easy it is to examine this stuff without actually having to run code.
Note that even a local label might not work:
la $3,hello
add $5,$6,$7
add $5,$6,$7
add $5,$6,$7
hello:
add $5,$6,$7
add $5,$6,$7
add $5,$6,$7
00000000 <hello-0x14>:
0: 3c030000 lui $3,0x0
4: 24630014 addiu $3,$3,20
8: 00c72820 add $5,$6,$7
c: 00c72820 add $5,$6,$7
10: 00c72820 add $5,$6,$7
00000014 <hello>:
14: 00c72820 add $5,$6,$7
18: 00c72820 add $5,$6,$7
1c: 00c72820 add $5,$6,$7
That is at the object level, the linker is going to replace those bits so for whatever reason the linker has put bits in that make it more confusing for the first time viewer:
mips-elf-ld -Ttext=0x12345678 so.o -o so.elf
mips-elf-objdump -D so.elf
Disassembly of section .text:
12345678 <_ftext>:
12345678: 3c031234 lui $3,0x1234
1234567c: 2463568c addiu $3,$3,22156
12345680: 00c72820 add $5,$6,$7
12345684: 00c72820 add $5,$6,$7
12345688: 00c72820 add $5,$6,$7
1234568c <hello>:
1234568c: 00c72820 add $5,$6,$7
12345690: 00c72820 add $5,$6,$7
12345694: 00c72820 add $5,$6,$7
The linker changed the 0x00000014 into the actual value once determined.
Yes, I am in no way trying to make a usable program that won't crash, it is up to the programmer ultimately to make sane programs. The tools are simply doing what I told them to do and I told them to take short instruction sequences that don't make much sense and don't terminate cleanly, etc, and just put them together. Even the four la instructions above, if COMPILED in a high level language:
unsigned int fun ( void )
{
unsigned int a;
a = 0x12345678;
a = 0x12340000;
a = 0x00005678;
a = 0x10000008;
return(a);
}
(optimized of course) gives
Disassembly of section .text:
00000000 <fun>:
0: 3c021000 lui $2,0x1000
4: 03e00008 jr $31
8: 24420008 addiu $2,$2,8
easier to read with arm:
Disassembly of section .text:
00000000 <fun>:
0: e3a00281 mov r0, #268435464 ; 0x10000008
4: e12fff1e bx lr
The compiler optimized out the other three operations as dead code. But assemblers generally as a rule do exactly what you told them to do. In the case of pseudo instructions as you are asking about, it is up to the assembler authors to choose to optimize, and well there are some assembly languages that are more vague than others, less explicit, that allow the assembler more room to choose the instructions. As we saw above the assembler did not optimize out those four instructions even though as programmers we see that each instruction overwrites bits we had just put in that register and the end result is 0x10000008.
MIPS is pretty explicit, but even in assembly language:
lui $2,0x1000
addiu $2,$2,8
jr $31
I asked for that without any command line arguments I get this:
00000000 <.text>:
0: 3c021000 lui $2,0x1000
4: 03e00008 jr $31
8: 24420008 addiu $2,$2,8
If I don't have the processor set for a branch shadow then I need to tell the assembler not to do that, or write code such that the assembler doesn't screw me over.
Also note in this case that the assembler chose to use lui + ori, the compiler chose to use lui + add. Or actually let's test the assembler:
la $2,0x10000008
jr $31
00000000 <.text>:
0: 3c021000 lui $2,0x1000
4: 03e00008 jr $31
8: 34420008 ori $2,$2,0x8
It was likely that two different individuals or teams did the port to MIPS.
I was going to go and show other instruction sets and how they can be vague in not necessarily giving you complete control over the exact instructions chosen, but that is perhaps just more of a tangent.
Assembly language is defined by the assembler, in this case if you are using SPIM that is an assembler, let's say linker, and instruction set simulator.
The assembler being the program that reads the text and turns it into machine code.
Having that job the assembler turns real and pseudo instructions into machine code. So it is the assembler at assembly time that turns la into the instruction pair if needed or a single instruction if the assembler was programmed to look for an optimization and chose a single instruction that functionally works.
Labels are addresses when a label is used with la because it is an absolute value not a pc-relative value so depending on the tool the assembler may or may not be able to resolve the address for this label and may have/desire to leave a two instruction placeholder for the linker to fill in once the address is known.
This is perhaps the missing link in your understanding, correct me if I am wrong I have no problem deleting this answer if it is off track. But a label is an address and address is ultimately just bits so at the end of the day the difference between:
la $5,0x12345678
and
la $5,some_label
is when the tools know what the bit pattern for the bits are and if they can optimize it into one instruction and when they place the bits into the machine code so that it is complete and can be executed.
Addresses, floating point numbers, signed integers, unsigned integers, pointers, ascii characters. These are all simply bit patterns to the processor, they have no meaning these terms mean something to the programmer but not the processor and not to the machine code.
The label becomes a bit pattern the bit pattern is encoded in the instruction. If there is an opportunity to optimize and the tool has been programmed to do it, then it may. If not programmed to do it, or the opportunity is not there or requires a significant amount of work/risk then it might not.
la
and using an assembler (and linker) is that you don't have to hand code addresses. If you insist then it's up to you to use the proper values. Instead of$1
you can use the destination register directly, as in your "elsewhere" sample. – Jester