4
votes

I am currently trying to implement a disassembler for the ARM cortex A9, which implement the ARMv7 instruction set.

For that I am using the manual "DDI0406C_b_arm_architecture_reference_manual.pdf" that can be download here (after having registered on arm website) :

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.set.architecture/index.html

In this manual, in the part A8.8 with the instructions details, I can't understand why there is several encoding for one instruction (like A1, A2, ...), that all seem to be implemented with ARMv7.

Also, as the ARM cortex A9 used thumb-2, does it also implement the A1/A2/... encodings, or only the T1/T2... ?

I really read all parts of this manual are related to encodings, but I still don't understand how we can know which encoding is used for a program.

3
Thank you very much everybody, all yours answers are really useful. I did not expected to have answers so fast.user2299676

3 Answers

3
votes

Different encoding of an instruction do functionally different things.

One example for usage of different encodings is A8.9.12 ADR

This instruction adds an immediate value to the PC value to form a PC-relative address, and writes the result to the destination register.

If instruction is encoded as A1 then offset must be interpreted as zero or positive, if it is encoded as A2 then offset is negative.

Another example is A8.8.132 POP

If the list contains more than one register, the instruction is assembled to encoding A1. If the list contains exactly one register, the instruction is assembled to encoding A2.

I can imagine different POP encodings are created probably to create different microcodes for performance reasons.

For the second part of your question, Cortex-A9 is an ARMv7-A architecture CPU and it supports all the instructions as specified in the manual you pointed. May be you should also read Cortex™-A9 Technical Reference Manual.

1
votes

There is no way to really distinguish between ARM and Thumb inside the instruction-stream. You can only decide based on the way a function gets called (if the lowest bit is set to 1 then it's thumb, otherwise arm).

The ARM-Encoding are quite "stable", you'll only find a few A1 encodings, BLX is an example where a A2 encoding is given, but this is mainly because the new ARM-ARM is structured differently from the older ones. BL and BLX were two different instructions, BLX was added in additional instruction space (the upper 4 bits which are normaly used for conditions are set to 1111, which in ARM prior to v5 meant "never execute".

For the Thumb-Encodings it's different, there are a lot of them, because they had to be placed in a more compressed instruction-space, page A6-220 has information about how to decide whatever an thumb instruction consists of two or just one halfword.

1
votes

The Ax encodings are arm when the processor is in arm mode it will decode bits it finds using those encodings. if there is more than one A1, A2, it should be obvious that there is a different feature or reason for that. those two instructions can be considered separate (look at the overuse of the mov instruction in x86 for example, it has many encodings). Treat each encoding as a separate "instruction".

Then there are the Tx variants, those are thumb and thumb2 extensions. The thumb are all 16 bit (the bl can be decoded as two separate 16 bit instructions) and the descriptions below them indicate "all thumb variants" or "armv4t to the present" or some such language. The thumb2 extensions are all 32 bit, the first 16 bits being an undefined instruction in the thumb world. These have more limitations on what architectures support them.

You are not going to be able to completely create a disassembler for one of these processors, for the same reason you cant do one for x86 or many other processors (all?). If you assume that all the instructions are one mode (arm or thumb or thumb+thumb2) but no mode mixing (arm+thumb) then you can because everything is fixed instruction length and you can simply disassemble everything data and code and you wont run into any problems. In order to disassemble mixed mode you have to basically emulate/execute the instructions and follow instruction flow (just like a variable word length instruction set disassembler) to try to find the transitions, problem here of course the transitions are multi instruction at a minimum load a register then bx that register, sometimes there is math involved on the instruction computation, and there is no guarantee that the address computation or load happens the instruction before the bx. So you could do some of that and get a long way through disassembling the program.

If thumb2 is supported/allowed on the processor you are using then you have the variable instruction length problem for times that you detect entry points to thumb code. And unless you already are doing this you have to follow execution of the code to determine where instructions start (elementary variable instruction length disassembly stuff).

The combination of technical reference manual and architectural reference manual will tell you if the architecture and implementation of that architecture (trm) allow arm and thumb modes. I would assume an A9 supports arm thumb and thumb2, all three.

The cortex-m family is the only one so far that is limited to not supporting arm, and their thumb2 varies widely as the cortex-m0 (and m1) are armv6m and the m3 and m4 are armv7m (A few dozen (armv6m) instructions to many dozen thumb2 extensions in the armv7m). There are separate architectural reference manuals specifically for the -m variants, armv7-m vs the armv7-ar manuals for example.