15

A Virtual Machine

Magicians protect their secrets not because the secrets are large and important, but because they are so small and trivial. The wonderful effects created on stage are often the result of a secret so absurd that the magician would be embarrassed to admit that that was how it was done.

Christopher Priest

We’ve spent a lot of time talking about how to represent a program as a sequence of bytecode instructions, but it feels like learning biology using only dead stuffed animals. We know what instructions are in theory, but we’ve never seen them living and breathing, so it’s hard to really understand what they do. It would be hard to write a compiler that outputs bytecode when we don’t have a good understanding of how that bytecode behaves.

So, before we go and build the front end of our new interpreter, we will begin with the back end — the virtual machine that executes instructions. It breathes life into the bytecode. Watching the instructions prance around gives us a clearer picture of how a compiler might translate the user’s source code into a series of them.

15 . 1 An Instruction Execution Machine

The “virtual machine” is one part of our interpreter’s internal architecture. You hand it a chunk of code — literally a Chunk — and it runs it. The code and data structures for the VM reside in a new module:

vm.h
create new file
#ifndef clox_vm_h
#define clox_vm_h

#include "chunk.h"

typedef struct {
  Chunk* chunk;
} VM;

void initVM();
void freeVM();

#endif

As usual, we start simple. The VM will gradually acquire a whole pile of state it needs to keep track of, so we define a struct now to stuff that all in. Currently, all we store is the chunk that it executes.

Like we do with most of the data structures we create, we also define functions to create and tear down a VM. Here’s the implementation:

vm.c
create new file
#include "common.h"
#include "vm.h"

VM vm; 

void initVM() {
}

void freeVM() {
}

OK, calling those functions “implementations” is a stretch. We don’t have any interesting state to initialize or free yet, so the functions are empty. Trust me, we’ll get there.

The slightly more interesting line here is that declaration of vm. This module is eventually going to have a slew of functions and it would be a chore to pass around a pointer to the VM to all of them. Instead, we declare a single global VM object. We only need one anyway, and this keeps the code in the book a little lighter on the page.

Before we start pumping fun code into our VM, let’s go ahead and wire it up to the interpreter’s main entrypoint:

int main(int argc, const char* argv[]) {
main.c
in main()
  initVM();

  Chunk chunk;

We spin up the VM when the interpreter first starts. Then when we’re about to exit, we wind it down:

  disassembleChunk(&chunk, "test chunk");
main.c
in main()
  freeVM();
  freeChunk(&chunk);

One last ceremonial obligation:

#include "debug.h"
main.c
#include "vm.h"

int main(int argc, const char* argv[]) {

Now when you run clox, it starts up the VM before it creates that hand-authored chunk from the last chapter. The VM is ready and waiting, so let’s teach it to do something.

15 . 1 . 1 Executing instructions

The VM springs into action when we command it to interpret a chunk of bytecode:

  disassembleChunk(&chunk, "test chunk");
main.c
in main()
  interpret(&chunk);
  freeVM();

This function is the main entrypoint into the VM. It’s declared like so:

void freeVM();
vm.h
add after freeVM()
InterpretResult interpret(Chunk* chunk);

#endif

The VMs runs the chunk and then responds with a value from this enum:

vm.h
typedef enum {
  INTERPRET_OK,
  INTERPRET_COMPILE_ERROR,
  INTERPRET_RUNTIME_ERROR
} InterpretResult;

We aren’t using the result yet, but when we have a compiler that reports static errors and a VM that detects runtime errors, the interpreter will use this to know how to set the exit code of the process.

We’re tiptoeing towards some actual implementation:

vm.c
add after freeVM()
InterpretResult interpret(Chunk* chunk) {
  vm.chunk = chunk;
  vm.ip = vm.chunk->code;

  InterpretResult result = run();
  return result;
}

First, we store the chunk being executed in the VM. Then we call run(), an internal helper function that actually runs the bytecode instructions. Between those two parts is an intriguing line. What is this ip business?

As the VM works its way through the bytecode, it keeps track of where it is — the location of the instruction currently being executed. We don’t use a local variable inside run() for this because eventually other functions will need to access it. Instead, we store it as a field in VM:

typedef struct {
  Chunk* chunk;
vm.h
  uint8_t* ip;
} VM;

Its type is a byte pointer. We use an actual real C pointer right into the middle of the bytecode array instead of something like an integer index because it’s faster to dereference a pointer than look up an element in an array by index.

The name “IP” is traditional, and — unlike many traditional names in CS — actually makes sense: it’s an instruction pointer. Almost every instruction set in the world, real and virtual, has a register or variable like this. The other common name for it is “PC” for “program counter”.

We initialize ip by pointing it at the first byte of code in the chunk. We haven’t executed that instruction yet, so ip points to the instruction about to be executed. This will be true during the entire time the VM is running: the IP always points to the next instruction, not the one currently being handled.

The real fun happens in run():

vm.c
add after freeVM()
static InterpretResult run() {
#define READ_BYTE() (*vm.ip++)

  for (;;) {
    uint8_t instruction;
    switch (instruction = READ_BYTE()) {
      case OP_RETURN: {
        return INTERPRET_OK;
      }
    }
  }

#undef READ_BYTE
}

This is the single most important function in all of clox, by far. When the interpreter executes a user’s program, it will spend something like 90% of its time inside run(). It is the beating heart of the VM.

Despite that dramatic intro, it’s conceptually pretty simple. You have an outer loop that goes and goes. Each turn through that loop, we read and execute a single bytecode instruction.

To process an instruction, we first figure out what kind of instruction we’re dealing with. The READ_BYTE macro reads the byte currently pointed at by ip and then advances the instruction pointer. The first byte of any instruction is the opcode. Given a numeric opcode, we need to get to the right C code that implements that instruction’s semantics. This process is called “decoding” or “dispatching” the instruction.

We to do that process for every single instruction, every single time one is executed, so this is the most performance critical part of the entire virtual machine. Programming language lore is filled with clever techniques to do bytecode dispatch efficiently, going all the way back to the early days of computers.

Alas, the fastest solutions require either non-standard extensions to C, or hand-written assembly code. For clox, we’ll keep it simple. Just like our disassembler, we have a single giant switch statement with a case for each opcode. The body of each case implements that opcode’s behavior.

So far, we only handle a single instruction, OP_RETURN, and the only thing it does is exit the loop entirely. Eventually, that instruction will be used to return from the current Lox function, but we don’t have functions yet, so we’ll repurpose it temporarily to end the execution.

Let’s go ahead and support our one other instruction:

    switch (instruction = READ_BYTE()) {
vm.c
in run()
      case OP_CONSTANT: {
        Value constant = READ_CONSTANT();
        printValue(constant);
        printf("\n");
        break;
      }
      case OP_RETURN: {

We don’t have enough machinery in place yet to do anything useful with a constant. For now, we’ll just print it out so we interpreter hackers can see what’s going on inside our VM. That call to printf() necessitates an include:

vm.c
add to top of file
#include <stdio.h>

#include "common.h"

We also have a new macro to define:

#define READ_BYTE() (*vm.ip++)
vm.c
in run()
#define READ_CONSTANT() (vm.chunk->constants.values[READ_BYTE()])

for (;;) {

READ_CONSTANT() reads the next byte from the bytecode, treats the resulting number as an index, and looks up the corresponding location in the chunk’s constant table. In later chapters, we’ll add a few more instructions with operands that refer to constants, so we’re setting up this helper macro now.

Like the previous READ_BYTE macro, READ_CONSTANT is only used inside run(). To make that scoping more explicit, the macro definitions themselves are confined to that function. We define them at the beginning and — because we care — undefine them at the end:

#undef READ_BYTE
vm.c
in run()
#undef READ_CONSTANT
}

If you run clox now, it executes the chunk we hand-authored in the last chapter and spits out 1.2 to your terminal. We can see that it’s working, but that’s only because our implementation of OP_CONSTANT has temporary code to log the value. Once that instruction is doing what it’s supposed to do and plumbing that constant along to other operations that want to consume it, the VM will become a black box. That makes our lives as VM implementers harder.

To help ourselves out, now is a good time to add some diagnostic logging to the VM like we did with chunks themselves. In fact, we’ll even reuse the same code. We don’t want this logging enabled all the time — it’s just for we VM hackers, not Lox users — so first we create a flag to hide it behind:

#include <stdint.h>
common.h

#define DEBUG_TRACE_EXECUTION

#endif

When this flag is defined, the VM disassembles and prints each instruction right before executing it. Where our previous disassembler walked an entire chunk once, statically, this disassembles instructions dynamically, on the fly:

  for (;;) {
vm.c
in run()
#ifdef DEBUG_TRACE_EXECUTION
    disassembleInstruction(vm.chunk, (int)(vm.ip - vm.chunk->code));
#endif

    uint8_t instruction;

Since disassembleInstruction() takes an integer byte offset and we store the current instruction reference as a direct pointer, we first do a little pointer math to convert ip back to a relative offset from the beginning of the bytecode. Then we disassemble the instruction that begins at that byte.

As ever, we need to bring in the declaration of the function before we can call it:

#include "common.h"
vm.c
#include "debug.h"
#include "vm.h"

I know this code isn’t super impressive so far — it’s literally a switch statement wrapped in a for loop but, believe it or not, this is one of the two major components of our VM. With this, we can imperatively execute instructions. Its simplicity is a virtue — the less work it does, the faster it can do it. Contrast this with all of the complexity and overhead we had in jlox with the Visitor pattern for walking the AST.


15 . 2 A Value Stack Manipulator

In addition to imperative side effects, Lox has expressions that produce, modify, and consume values. Thus, our compiled bytecode needs a way to shuttle values around between the different instructions that need them. For example, in:

print 3 - 2;

We obviously need instructions for the constants 3 and 2, the print statement, and the subtraction. But how does the subtraction instruction know that 3 is the minuend and 2 is the subtrahend? How does the print instruction know to print the result of that?

To put a finer point on it, look at this thing right here:

fun echo(n) {
  print n;
  return n;
}

print echo(echo(1) + echo(2)) + echo(echo(4) + echo(5));

I wrapped each subexpression in a call to echo() that prints and returns its argument. That side effect means we can see the exact order of operations.

Don’t worry about the VM for a minute. Think about just the semantics of Lox itself. The operands to an arithmetic operator obviously need to be evaluated before we can perform the operation itself. (It’s pretty hard to add a + b if you don’t know what a and b are.) I haven’t specified this yet, but I’ll go ahead and declare that in Lox, the left-hand side of a binary operator is evaluated before the right.

Here is the syntax tree for the print statement:

The AST for the example
statement, with numbers marking the order that the nodes are evaluated.

Given left-to-right evaluation, and the way the expressions are nested, any correct Lox implementation must print these numbers in this order:

1  // from echo(1)
2  // from echo(2)
3  // from echo(1 + 2)
4  // from echo(4)
5  // from echo(5)
9  // from echo(4 + 5)
12 // from print 3 + 9

Our old jlox interpreter accomplishes this by recursively traversing the AST. It does a post-order traversal. First it recurses down the left operand branch, then the right operand, then finally it evaluates the node itself.

After evaluating the left operand, jlox needs to store that result somewhere temporarily while it’s busy traversing down through the right operand tree. We used a local variable in Java for that. Our recursive tree-walk interpreter creates a unique Java call frame for each node being evaluated, so we could have as many of these local variables as we needed.

In clox, our run() function is not recursive — the nested expression tree is flattened out into a linear series of instructions. We don’t have the luxury of using C local variables, so how and where should we store these temporary values? You can probably guess already, but I want to really drill into this because it’s an aspect of programming that we take for granted, but we rarely learn why computers are architected this way.

Let’s do a weird exercise. We’ll walk through the execution of the above program a step at a time:

The series of instructions with
bars showing which numbers need to be preserved across which instructions.

On the left are the steps of code. On the right are the values we’re tracking. Each bar represents a number. It starts when the value is first produced — either a constant or the result of an addition. The length of the bar tracks when a previously-produced value needs to be kept around, and it ends when that value finally gets consumed by an operation.

As you step through, you see values appear and then later get eaten. The longest-lived ones are the values produced from the left-hand side of an addition. Those stick around while we work through the right-hand operand expression.

In the above diagram, I gave each unique number its own visual column. Let’s be a little more parsimonious. Once a number is consumed, we allow its column to be reused for another later value. In other words, we take all of those gaps up there and fill them in with numbers from the right:

Like the previous
diagram, but with number bars pushed to the left, forming a stack.

There’s some interesting stuff going on here. When we shift everything over, each number still manages to stay in a single column for its entire life. Also, there are no gaps left. In other words, whenever a number appears earlier than another, then it will live at least as long as that second one. The first number to appear is the last to be consumed. Hmm… last-in, first-out… Why, that’s a stack!

In the second diagram, each time we introduce a number, we push it onto the stack from the right. When numbers are consumed, they are always popped off from rightmost to left.

Since the temporary values we need to track naturally have stack-like behavior, our VM will use a stack to manage them. When an instruction “produces” a value, it pushes it onto the stack. When it needs to consume one or more values, it gets them by popping them off the stack.

15 . 2 . 1 The VM’s Stack

Maybe this doesn’t seem like a revelation, but I love stack-based VMs. When you first see a magic trick, it feels like something actually magical. But then you learn how it works — usually some mechanical gimmick or misdirection — and the sense of wonder evaporates. There are a couple of ideas in computer science where even after I pulled them apart and learned all the ins and outs, some of the initial sparkle remained. Stack-based VMs are one of those.

As you’ll see in this chapter, executing instructions in a stack-based VM is dead simple. In later chapters, you’ll also discover that compiling a source language to a stack-based instruction set is a piece of cake. And, yet, this architecture is fast enough to be used by production language implementations. It almost feels like cheating at the programming language game.

Alrighty, it’s codin’ time! Here’s the stack:

typedef struct {
  Chunk* chunk;
  uint8_t* ip;
vm.h
  Value stack[STACK_MAX];
  Value* stackTop;
} VM;

We implement the stack semantics ourselves on top of a raw C array. The bottom of the stack — the first value pushed and the last to be popped — is at element zero in the array, and later pushed values follow it.

An array containing the
letters in 'crepe' in order starting at element 0.

Since the stack grows and shrinks as values are pushed and popped, we need to track where the top of the stack is in the array. As with ip, we use a direct pointer instead of an integer index since it’s faster to dereference the pointer than calculate the offset from the index each time we need it.

The pointer points at the array element just past the element containing the top value on the stack. That seems a little odd, but almost every implementation does this. It means we can indicate that the stack is empty by pointing at element zero in the array:

An empty array with
stackTop pointing at the first element.

If we pointed to the top element, then for an empty stack we’d need to point at element -1. That’s undefined in C. As we push values onto the stack…

An array with 'c' at element
zero.

stackTop always points just past the last item…

An array with 'c', 'r',
'e', 'p', and 'e' in the first five elements.

I remember it like this: stackTop points to where the next value to be pushed will go. The maximum number of values we can store on the stack (for now, at least) is:

#include "chunk.h"
vm.h

#define STACK_MAX 256

typedef struct {

Giving our VM a fixed stack size means it’s possible for some sequence of instructions to push too many values and run out of stack space — the classic “stack overflow”. We could grow the stack dynamically as needed, but for now we’ll keep it simple. Since VM uses Value, we need to include its declaration:

#include "chunk.h"
vm.h
#include "value.h"

#define STACK_MAX 256

Now that VM has some interesting state, we get to initialize it:

void initVM() {
vm.c
in initVM()
  resetStack();
}

That uses this helper function:

vm.c
static void resetStack() {
  vm.stackTop = vm.stack;
}

Since the stack array is declared directly inline in the VM struct, we don’t need to allocate it. We don’t even need to clear the unused cells in the array — we simply won’t access them until after values have been stored in them. The only initialization we need is to set stackTop to point to the beginning of the array to indicate that the stack is empty.

The stack protocol supports two operations:

InterpretResult interpret(Chunk* chunk);
vm.h
add after interpret()
void push(Value value);
Value pop();

#endif

You can push a new value onto the top of the stack, and you can pop the most recently pushed value back off. Here’s the first function:

vm.c
add after freeVM()
void push(Value value) {
  *vm.stackTop = value;
  vm.stackTop++;
}

If you’re rusty on your C pointer syntax and operations, this is a good warm-up. The first line stores value in the array element at the top of the stack. Remember, stackTop points just past the last used element, at the next available one. This stores the value in that slot.

Then we increment the pointer itself to point to the next unused slot in the array now that the previous slot is occupied. Popping is the mirror image:

vm.c
add after push()
Value pop() {
  vm.stackTop--;
  return *vm.stackTop;
}

First, we move the stack pointer back to get to the most recent used slot in the array. Then we look up the value at that index and return it. We don’t need to explicitly “remove” it from the array — moving stackTop down is enough to mark that slot as no longer in use.

We have a working stack, but it’s hard to see that it’s working. When we start implementing more complex instructions and compiling and running larger pieces of code, we’ll end up with a lot of values crammed into that array. It would make our lives as VM hackers easier if we had some visibility into the stack.

To that end, whenever we’re tracing execution, we’ll also show the current contents of the stack before we interpret each instruction:

#ifdef DEBUG_TRACE_EXECUTION
vm.c
in run()
    printf("          ");
    for (Value* slot = vm.stack; slot < vm.stackTop; slot++) {
      printf("[ ");
      printValue(*slot);
      printf(" ]");
    }
    printf("\n");
    disassembleInstruction(vm.chunk, (int)(vm.ip - vm.chunk->code));

We loop, printing each value in the array, starting at the first (bottom of the stack) and ending when we reach the top. This lets us observe the effect of each instruction on the stack. The output is pretty verbose, but it’s useful when we’re tracking down a bug deep in the sewers of the interpreter.

Stack in hand, let’s revisit our two instructions. First up:

      case OP_CONSTANT: {
        Value constant = READ_CONSTANT();
vm.c
in run()
replace 2 lines
        push(constant);
        break;

In the last chapter, I was hand-wavey about how the OP_CONSTANT instruction “loads” a constant. Now that we have a stack you know what it means to actually produce a value: it gets pushed onto the stack.

      case OP_RETURN: {
vm.c
in run()
        printValue(pop());
        printf("\n");
        return INTERPRET_OK;

Then we make OP_RETURN pop the stack and print the top value before exiting. When we add support for real functions to clox, we’ll change this code. But, for now, it gives us a way to get the VM executing simple instruction sequences and displaying the result.

15 . 3 An Arithmetic Calculator

The heart and soul of our VM are in place now. The bytecode loop dispatches and executes instructions. The stack grows and shrinks as values flow through it. The two halves work, but it’s hard to get a feel for how cleverly they interact with only the two rudimentary instructions we have so far. So let’s teach our interpreter to do arithmetic.

We’ll start with the simplest arithmetic operation, unary negation:

var a = 123;
print -a; // -123.

The prefix - operator takes one operand, the value to negate. It produces a single result. We aren’t fussing with a parser yet, but we can add the bytecode instruction that the above syntax will compile to:

  OP_CONSTANT,
chunk.h
  OP_NEGATE,
  OP_RETURN,

We execute it like so:

      }
vm.c
in run()
      case OP_NEGATE:   push(-pop()); break;
      case OP_RETURN: {

It needs a value to operate on, so it gets it by popping it from the stack. It negates that, then pushes the result back on for later instructions to use. Doesn’t get much easier than that. We can disassemble it too:

    case OP_CONSTANT:
      return constantInstruction("OP_CONSTANT", chunk, offset);
debug.c
in disassembleInstruction()
    case OP_NEGATE:
      return simpleInstruction("OP_NEGATE", offset);
    case OP_RETURN:

And we can try it out in our test chunk:

  writeChunk(&chunk, constant, 123);
main.c
in main()
  writeChunk(&chunk, OP_NEGATE, 123);

writeChunk(&chunk, OP_RETURN, 123);

After loading the constant, but before returning, we execute the negate instruction. That replaces the constant on the stack with its negation. Then the return instruction prints that out:

-1.2

Magical!

15 . 3 . 1 Binary operators

OK, unary operators aren’t that impressive. We still only ever have a single value on the stack. To really see some depth, we need binary operators. Lox has four binary arithmetic operators: addition, subtraction, multiplication, and division. We’ll go ahead and implement them all at the same time:

  OP_CONSTANT,
chunk.h
  OP_ADD,
  OP_SUBTRACT,
  OP_MULTIPLY,
  OP_DIVIDE,
  OP_NEGATE,

Back in the bytecode loop, they look like:

      }
vm.c
in run()
      case OP_ADD:      BINARY_OP(+); break;
      case OP_SUBTRACT: BINARY_OP(-); break;
      case OP_MULTIPLY: BINARY_OP(*); break;
      case OP_DIVIDE:   BINARY_OP(/); break;
      case OP_NEGATE:   push(-pop()); break;

The only difference between these four instructions is which underlying C operator they ultimately use to combine the two operands. Surrounding that core arithmetic expression is some boilerplate code to pull values off the stack and push the result. When we later add dynamic typing, that boilerplate will grow. To avoid repeating that code four times, I wrapped it up in a macro:

#define READ_CONSTANT() (vm.chunk->constants.values[READ_BYTE()])
vm.c
in run()

#define BINARY_OP(op) \ do { \ double b = pop(); \ double a = pop(); \ push(a op b); \ } while (false)

I admit this is a fairly adventurous use of the C preprocessor. I hesitated to do this, but you’ll be glad in later chapters when we need to add the type checking for each operand and stuff. It would be a chore to walk you through the same code four times.

If you aren’t familiar with the trick already, that outer do-while loop probably looks really weird. This macro needs to expand to a series of statements. To be careful macro authors, we want to ensure those statements all end up in the same scope when the macro is expanded. Imagine if you defined:

#define WAKE_UP() makeCoffee(); drinkCoffee();

And then used it like:

if (morning) WAKE_UP();

The intent is to execute both statements of the macro body only if morning is true. But it expands to:

if (morning) makeCoffee(); drinkCoffee();

Oops. The if only attaches to the first statement. You might think you could fix this using a block:

#define WAKE_UP() { makeCoffee(); drinkCoffee(); }

That’s better, but you still risk:

if (morning)
  WAKE_UP();
else
  sleepIn();

Now you get a compile error on the else because of that trailing ; after the macro’s block. Using a do-while loop in the macro looks funny, but it gives you a way to contain multiple statements inside a block that also permits a semicolon at the end.

Where were we? Right, so what the body of that macro does is straightforward. A binary operator takes two operands, so it pops twice. It performs the operation on those two values and then pushes the result.

Pay close attention to the order of the two pops. Note that we assign the first popped operand to b, not a. It looks backwards. When the operands themselves are calculated, the left is evaluated first, then the right. That means the left operand gets pushed before the right operand. So the right operand will be on top of the stack. Thus, the first value we pop is b.

For example, if we compile 3 - 1, the data flow between the instructions looks like so:

A sequence of instructions
with the stack for each showing how pushing and then popping values reverses
their order.

As we did with the other macros inside run(), we clean up after ourselves at the end of the function:

#undef READ_CONSTANT
vm.c
in run()
#undef BINARY_OP
}

Last is disassembler support:

    case OP_CONSTANT:
      return constantInstruction("OP_CONSTANT", chunk, offset);
debug.c
in disassembleInstruction()
    case OP_ADD:
      return simpleInstruction("OP_ADD", offset);
    case OP_SUBTRACT:
      return simpleInstruction("OP_SUBTRACT", offset);
    case OP_MULTIPLY:
      return simpleInstruction("OP_MULTIPLY", offset);
    case OP_DIVIDE:
      return simpleInstruction("OP_DIVIDE", offset);
    case OP_NEGATE:

The arithmetic instructions are simple, like OP_RETURN. Even though the arithmetic operators take operands — which are found on the stack — the arithmetic bytecode instructions do not.

Let’s put some of our new instructions through their paces by evaluating a larger expression:

The expression being
evaluated: -((1.2 + 3.4) / 5.6)

Building on our existing example chunk, here’s the additional instructions we need:

  int constant = addConstant(&chunk, 1.2);
  writeChunk(&chunk, OP_CONSTANT, 123);
  writeChunk(&chunk, constant, 123);
main.c
in main()

constant = addConstant(&chunk, 3.4); writeChunk(&chunk, OP_CONSTANT, 123); writeChunk(&chunk, constant, 123); writeChunk(&chunk, OP_ADD, 123); constant = addConstant(&chunk, 5.6); writeChunk(&chunk, OP_CONSTANT, 123); writeChunk(&chunk, constant, 123); writeChunk(&chunk, OP_DIVIDE, 123);
  writeChunk(&chunk, OP_NEGATE, 123);

  writeChunk(&chunk, OP_RETURN, 123);

The addition goes first. The instruction for the left constant, 1.2, is already there, so we add another for 3.4. Then we add those two using OP_ADD, leaving it on the stack. That covers the left side of the division. Next we push the 5.6, and divide it by the result of the addition. Finally, we negate the result of that.

Note how the output of the OP_ADD implicitly flows into being an operand of OP_DIVIDE without either instruction being directly coupled to each other. That’s the magic of the stack. It lets us freely compose instructions without them needing any complexity or awareness of the data flow. The stack acts like a shared workspace that they all read from and write to.

In this tiny example chunk, the stack still only gets two values tall, but when we start compiling Lox source to bytecode, we’ll have chunks that use much more of the stack. In the meantime, try playing around with this hand-authored chunk to calculate different nested arithmetic expressions and see how values flow through the instructions and stack.

You may as well get it out of your system now. This is the last chunk we’ll build by hand. When we next revisit bytecode, we will be writing a compiler to generate it for us.

Challenges

  1. What bytecode instruction sequences would you generate for the following expressions:

    1 * 2 + 3
    1 + 2 * 3
    3 - 2 - 1
    1 + 2 * 3 - 4 / -5
    

    (Remember that Lox does not have a syntax for negative number literals, so the -5 is negating the number 5.)

  2. If we really wanted a minimal instruction set, we could eliminate either OP_NEGATE or OP_SUBTRACT. Show the bytecode instruction sequence you would generate for:

    4 - 3 * -2
    

    First, without using OP_NEGATE. Then, without using OP_SUBTRACT.

    Given the above, do you think it makes sense to have both instructions? Why or why not? Are there any other redundant instructions you would consider including?

  3. Our VM’s stack has a fixed size, and we don’t check if pushing a value overflows it. This means the wrong series of instructions could cause our interpreter to crash or go into undefined behavior. Avoid that by dynamically growing the stack as needed.

    What are the costs and benefits of doing so?

Design Note: Register-Based Bytecode

For the remainder of this book, we’ll meticulously implement an interpreter around a stack-based bytecode instruction set. There’s another family of bytecode architectures out there — register-based. Despite the name, these bytecode instructions aren’t quite as difficult to work with as the registers in an actual chip like x64. With real hardware registers, you usually only have a handful for the entire program, so you spend a lot of effort trying to use them efficiently and shuttling stuff in and out of them.

In a register-based VM, you still have a stack. Temporary values still get pushed onto it and popped when no longer needed. The main difference is that instructions can read their inputs from anywhere in the stack and can store their outputs into specific stack slots.

Take this little Lox script:

var a = 1;
var b = 2;
var c = a + b;

In our stack-based VM, the last statement will get compiled to something like:

load <a>  // Read local variable a and push onto stack.
load <b>  // Read local variable b and push onto stack.
add       // Pop two values, add, push result.
store <c> // Pop value and store in local variable c.

(Don’t worry if you don’t fully understand the load and store instructions yet. We’ll go over them in much greater detail when we implement variables.) We have four separate instructions. That means four times through the bytecode interpret loop, four instructions to decode and dispatch. It’s at least seven bytes of code — four for the opcodes and another three for the operands identifying which locals to load and store. Three pushes and three pops. A lot of work!

In a register-based instruction set, instructions can read from and store directly into local variables. The bytecode for the last statement above looks like:

add <a> <b> <c> // Read values from a and b, add, store in c.

The add instruction is bigger — it has three instruction operands that define where in the stack it reads its inputs from and writes the result to. But since local variables live on the stack, it can read directly from a and b and then store the result right into c.

There’s only a single instruction to decode and dispatch, and the whole thing fits in four bytes. Decoding is more complex because of the additional operands, but it’s still a net win. There’s no pushing and popping or other stack manipulation.

The main implementation of Lua used to be stack-based. For Lua 5.0, the implementers switched to a register instruction set and noted a speed improvement. The amount of improvement, naturally, depends heavily on the details of the language semantics, specific instruction set, and compiler sophistication, but that should get your attention.

That raises the obvious question of why I’m going to spend the rest of the book doing a stack-based bytecode. Register VMs are neat, but they are quite a bit harder to write a compiler for. For what is likely to be your very first compiler, I wanted to stick with an instruction set that’s easy to generate and easy to execute. Stack-based bytecode is marvelously simple.

It’s also much better known in the literature and the community. Even though you may eventually move to something more advanced, it’s a good common ground to share with the rest of your language hacker peers.