Thomas Denney

JIT Compilation on an ARM Cortex M0

For my bachelor's thesis last year I implemented a JIT compiler for the BBC micro:bit, a micro-controller that is being used around the world to teach programming in schools. My compiler took bytecode for a virtual machine designed by my supervisor, compiled it to ARM Thumb bytecode, and executed the generated bytecode. The main challenge here is getting the JIT compiler to store the original virtual machine bytecode, auxiliary data structures, compiled bytecode, and program data in 16 KB of RAM. Rather than replicate my thesis in full here I'm just going to run through the basic steps involved to JIT compile and execute functions on a Cortex M0; I assume familiarity with C++ and ARM Assembly and calling conventions.

At a minimum, a JIT code generators needs to perform three basic steps:

  1. Generate a sequence of instructions encoded in the processor's bytecode;
  2. Jump to that sequence of instructions, allowing the processor to execute them; and
  3. Return from the generated code to the environment.

The simplest way to get this working is to generate a sequence of instructions terminated by bx lr, which corresponds to a return in ARM assembly, and then to call that sequence of instructions as if it is a function:

typedef void (*f_void)(void);

static const uint16_t instrs[] = { 0x4770 };
asm("DSB");
asm("ISB");
f_void fp = (f_void)((unsigned)instrs | 0x1);
fp();

In this basic example our generated code is a compile time constant, but this example adapts to dynamically allocated arrays too. 0x4770 is the hexadecimal encoding of bx lr, the ARM instruction that corresponds to branching to the address in the link register, which will be the return address:

Note that the Cortex M0 executes Thumb bytecode, which is a compact version of the full ARM 32-bit instruction set; most instructions are represented in 2 bytes. The most significant 9 bits of this instruction encode that this is a branch and exchange instruction and the subsequent 4 bits encode the register that contains the address to jump to (the link register is register r14). The last three bits should be zero.

Next, two inline assembly instructions are issued. DSB is a data synchronisation barrier, and ensures that no instruction in program executes until after this instruction completes, i.e. the instruction ensures that all memory operations complete. Next, ISB is an instruction synchronisation barrier, and flushes the fetch pipeline of the processor. In this case, where our JIT'd code is actually in a static array, these instructions are not necessary. However, when generating instructions it is necessary to ensure that (a) our generated instructions are fully written to memory, and (b) the fetch pipeline doesn't attempt to fetch the old value. These instructions must be issued across all ARM architectures (not just the Cortex M0) when running JIT'd code.

Memory protection typically prevents memory being both writeable and executable, so in JITs on other OSes it would generally be necessary to prevent writing and only permit executing the code. No such protection exists on the Cortex M0, so only DSB and ISB must be executed.

Next, we cast the pointer to the instructions sequence to a pointer to a function pointer so we can call the code. Our array will be 2-byte aligned (which is required for Thumb instructions), but when branching the LSB of the address we jump to must be 1 to indicate that the instruction we are jumping to is encoded in Thumb bytecode rather than full 4-byte ARM bytecode.

Supposing we have access to a memory allocator, we can adapt the above to dynamically generate the same function at runtime:

uint16_t* instrs = (uint16_t*)calloc(1, sizeof(uint16_t));

instrs[0] = 0x4770;
asm("DSB");
asm("ISB");
f_void fp = (f_void)((unsigned)instrs | 0x1);
fp();

free(instrs);

This function isn't particularly interesting: it has no observable side-effects and doesn't return anything. As a toy example, the following function JIT-compiles a function that sums between 0 and 4 integer values. The first four integer arguments are always passed in registers under ARM calling convention, and the result of an integer function is returned in r0.

typedef int (*f_4_int_to_int)(int, int, int, int);

int jit_int_sum(unsigned int n)
{
    std::vector<uint16_t> instrs;
    if (n == 0) {
        // MOV   Rd  Value
        // 00100 000 00000000
        // 2   0     0   0
        instrs.push_back(0x2000);
    } else {
        for (unsigned k = 1; k != n; ++k) {
            // Rd := Rn + Rm
            // Add     Rm  Rn  Rd
            // 0001100 000 xxx 000
            // 1   8    0    0
            instrs.push_back(0x1800 | (k << 3));
        }
    }
    instrs.push_back(0x4770); // Return

    asm("DSB");
    asm("ISB");
    auto fp = (f_4_int_to_int)((unsigned)instrs.data() | 0x1);
    return fp(1, 2, 3, 4);
}

Note that this time I've used a C++ vector to simplify the memory management. In this example we either clear the return register or sum up to three remaining registers into the first register. For simplicity, the function pointer is always cast to the same type and called with the same number of arguments, but the return value of the function will differ depending on the value of n.

Encoding instructions manually in hexadecimal (or binary) isn't a particularly good use of your time. In my project I implemented a library for all Thumb instructions, and eventually I also implemented a decoder to help with debugging. Helpfully the decoder (printFunction) can take a pointer to an arbitrary ARM function, allowing runtime bytecode -> assembly conversion.

The next challenge is dealing with branches. In a traditional compiler this can be delegated to a linker, but in a JIT where all code bytecode must be generated it is necessary to deal with them yourself. The conditional branch is encoded as:

The condition is a 4-bit value and is used in conjunction with the condition flags of the CPU to determine if the branch is taken. I specify the values of these conditions in my instruction encoder here. The immediate value is a signed (two's complement) 8-bit integer. If the condition holds, the value is sign extended to a 32-bit value, doubled, and added to the current program counter, so that the branch supports offsets from -256 bytes to +254 bytes. However, it is important to note that the program counter, at the time the branch instruction begins, is the address of the branch + 4 bytes. Therefore if the conditional branch needs to skip forward by 2 instructions, then the offset should be 0. If the branch needs to skip back 10 instructions, then the offset should be -12. The same encoding scheme is necessary for unconditional branches and bl/blx instructions (used for functional calls), although these both admit 11-bit offsets.

The examples above are able to freely use registers because they conform to the basics of ARM calling convention; dealing with function calls and access to other registers only requires that you follow this convention in generated code.

The full code of microjit is on GitHub. I've also uploaded a barebones example showing the examples from this post executing.