Computed Goto in gcc, Computed goto for efficient dispatch tables

Computed Goto in gcc, Computed goto for efficient dispatch tables

Recently, while idly browsing through the source code of Python, I came upon an interesting comment in the bytecode VM implementation (Python/ceval.c) about using the computed gotos extension of GCC [1]. Driven by curiosity, I decided to code a simple example to evaluate the difference between using a computed goto and a traditional switch statement for a simple VM. This post is a summary of my findings.

Defining a simple bytecode VM
First let’s make clear what I mean by a “VM” in this context – a Bytecode Interpreter. Simply put, it’s a loop that chugs through a sequence of instructions, executing them one by one.

Using Python’s 2000-line strong (a bunch of supporting macros not included) PyEval_EvalFrameEx as an example wouldn’t be very educational. Therefore, I’ll define a tiny VM whose only state is an integer and has a few instructions for manipulating it. While simplistic, the general structure of this VM is very similar to real-world VMs. This VM is so basic that the best way to explain it is just to show its implementation:

#define OP_HALT     0x0
#define OP_INC      0x1
#define OP_DEC      0x2
#define OP_MUL2     0x3
#define OP_DIV2     0x4
#define OP_ADD7     0x5
#define OP_NEG      0x6

int interp_switch(unsigned char* code, int initval) {
    int pc = 0;
    int val = initval;

    while (1) {
        switch (code[pc++]) {
            case OP_HALT:
                return val;
            case OP_INC:
                val++;
                break;
            case OP_DEC:
                val--;
                break;
            case OP_MUL2:
                val *= 2;
                break;
            case OP_DIV2:
                val /= 2;
                break;
            case OP_ADD7:
                val += 7;
                break;
            case OP_NEG:
                val = -val;
                break;
            default:
                return val;
        }
    }
}

Note that this is perfectly “standard” C. An endless loop goes through the instruction stream and a switch statement chooses what to do based on the instruction opcode. In this example the control is always linear (pc only advances by 1 between instructions), but it would not be hard to extend this with flow-control instructions that modify pc in less trivial ways.

The switch statement should be implemented very efficiently by C compilers – the condition serves as an offset into a lookup table that says where to jump next. However, it turns out that there’s a popular GCC extension that allows the compiler to generate even faster code.

Computed gotos
I will cover the details of computed gotos very briefly. For more information, turn to the GCC docs or Google.

Computed gotos is basically a combination of two new features for C. The first is taking addresses of labels into a void*.

void* labeladdr = &&somelabel;
somelabel:
// code
The second is invoking goto on a variable expression instead of a compile-time-known label, i.e.:

void* table[]; // addresses
goto *table[pc];
As we’ll see shortly, these two features, when combined, can facilitate an interesting alternative implementation of the main VM loop.

To anyone with a bit of experience with assembly language programming, the computed goto immediately makes sense because it just exposes a common instruction that most modern CPU architectures have – jump through a register (aka. indirect jump).

The simple VM implemented with a computed goto
Here’s the same VM, this time implemented using a computed goto [2]:

int interp_cgoto(unsigned char* code, int initval) {
    /* The indices of labels in the dispatch_table are the relevant opcodes
    */
    static void* dispatch_table[] = {
        &&do_halt, &&do_inc, &&do_dec, &&do_mul2,
        &&do_div2, &&do_add7, &&do_neg};
    #define DISPATCH() goto *dispatch_table]

    int pc = 0;
    int val = initval;

    DISPATCH();
    while (1) {
        do_halt:
            return val;
        do_inc:
            val++;
            DISPATCH();
        do_dec:
            val--;
            DISPATCH();
        do_mul2:
            val *= 2;
            DISPATCH();
        do_div2:
            val /= 2;
            DISPATCH();
        do_add7:
            val += 7;
            DISPATCH();
        do_neg:
            val = -val;
            DISPATCH();
    }
}

Benchmarking
I did some simple benchmarking with random opcodes and the goto version is 25% faster than the switch version. This, naturally, depends on the data and so the results can differ for real-world programs.

Comments inside the CPython implementation note that using computed goto made the Python VM 15-20% faster, which is also consistent with other numbers I’ve seen mentioned online.

Why is it faster?
Further down in the post you’ll find two “bonus” sections that contain annotated disassembly of the two functions shown above, compiled at the -O3 optimization level with GCC. It’s there for the real low-level buffs among my readers, and as a future reference for myself. Here I aim to explain why the computed goto code is faster at a bit of a higher level, so if you feel there are not enough details, go over the disassembly in the bonus sections.

The computed goto version is faster because of two reasons:

The switch does a bit more per iteration because of bounds checking.
The effects of hardware branch prediction.
Doing less per iteration
If you examine the disassembly of the switch version, you’ll see that it does the following per opcode:

Execute the operation itself (i.e. val *= 2 for OP_MUL2)
pc++
Check the contents of code[pc]. If within bounds (<= 6), proceed. Otherwise return from the function.
Jump through the jump table based on offset computed from code[pc].
On the other hand, the computed goto version does this:

Execute the operation itself
pc++
Jump through the jump table based on offset computed from code[pc].
The difference between the two is obviously the "bounds check" step of the switch. Why is it required? You may think that this is because of the default clause, but that isn't true. Even without the default clause, the compiler is forced to generate the bounds check for the switch statement to conform to the C standard. Quoting from C99:

If no converted case constant expression matches and there is no default label, no part of the switch body is executed.
Therefore, the standard forces the compiler to generate "safe" code for the switch. Safety, as usual, has cost, so the switch version ends up doing a bit more per loop iteration.

Branch prediction
Modern CPUs have deep instruction pipelines and go to great lengths ensuring that the pipelines stay as full as possible. One thing that can ruin a pipeline's day is a branch, which is why branch predictors exist. Put simply (read the linked Wikipedia article for more details), it's an algorithm used by the CPU to try to predict in advance whether a branch will be taken or not. Since a CPU can easily pre-fetch instructions from the branch's target, successful prediction can make the pre-fetched instructions valid and there is no need to fully flush the pipeline.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s