30

Optimization

The evening’s the best part of the day. You’ve done your day’s work. Now you can put your feet up and enjoy it.

Kazuo Ishiguro, The Remains of the Day

If I still lived in New Orleans, I’d call this chapter a lagniappea little something extra given for free to a customer. You’ve got a whole book and a complete virtual machine already, but I want you to have some more fun hacking on cloxthis time purely for performance. We’ll apply two very different optimizations to our virtual machine. In the process, you’ll get a feel for measuring and improving the performance of a language implementation or any program, really.

30 . 1Measuring Performance

Optimization means taking a working application and improving its performance along any number of axes. In contrast with other programming, it doesn’t aim to change what the program does, instead, it reduces the number of resources it takes to do so. We usually assume optimization is for runtime speed. That tends to be the main metric we optimize, but it’s often important to reduce memory usage, startup time, persistent storage size, or network bandwidth. All physical resources have some costeven if the cost is mostly in wasted human timeso optimization work often pays off.

There was a time in the early days of computing that a skilled programmer could hold the entire hardware architecture and compiler pipeline in their head and understand a program’s performance just by thinking real hard. Those days are long gone, separated from the present by microcode, cache lines, branch prediction, deep compiler pipelines, and mammoth instruction sets. We like to pretend C is a “low level” language, but the stack of technology between printf("Hello, world!"); and a greeting appearing on screen is now miles tall.

Optimization today is an empirical science. Our program is a border collie sprinting through the hardware’s obstacle course. If we want her to reach the end faster, we can’t just sit and ruminate on canine physiology until enlightenment strikes. Instead, we need to observe her performance, see where she stumbles, and then find faster paths for her to take.

Much of that process is particular to one dog on one obstacle course. When we write optimizations for our virtual machine, we also can’t assume that we will make all Lox programs run faster on all hardware. Different Lox programs stress different areas of the VM, and different architectures have their own strengths and weaknesses.

30 . 1 . 1Benchmarks

When we add new functionality, we validate correctness by writing testsLox programs that use a feature and validate the VM’s behavior. Tests pin down semantics and ensure we don’t break existing features when we add new ones. We have similar needs when it comes to performance:

  1. How do we validate that an optimization does improve performance and by how much?

  2. How do we ensure that other unrelated changes don’t regress performance?

The Lox programs we write to accomplish those goals are benchmarks. These are carefully crafted programs that stress some part of the language implementation. They measure not what the program does, but how long it takes to do it.

By measuring the performance of a benchmark before and after a change, you can see what your change does. When you land an optimization, all of the tests should behave exactly the same as they did before, but hopefully the benchmarks run faster.

Once you have an entire suite of benchmarks, you can measure not just that an optimization changes performance, but on which kinds of code. Often you’ll find that some benchmarks get faster while others get slower. Then you have to make hard decisions about what kinds of code your language implementation optimizes for.

The suite of benchmarks you choose to write is a key part of that decision. In the same way that your tests encode your choices around what correct behavior looks like, your benchmarks are the embodiment of your priorities when it comes to performance. They will guide which optimizations you implement, so choose your benchmarks carefully and don’t forget to periodically reflect on whether they are helping you reach your larger goals.

Benchmarking is a subtle art. Like tests, you need to balance not overfitting to your implementation while ensuring that the benchmark does actually tickle the code paths that you care about. When you measure performance, you need to compensate for variance caused by CPU throttling, caching, and other weird hardware and operating system quirks. I won’t give you a whole sermon here, but treat benchmarking as its own skill that improves with practice.

30 . 1 . 2Profiling

OK, so you’ve got a few benchmarks now. You want to make them go faster. Now what? First of all, let’s assume you’ve done all the obvious easy work. You are using the right algorithms and data structuresor, at least, you aren’t using ones that are aggressively wrong. I don’t consider using a hash table instead of a linear search through a huge unsorted array “optimization” so much as “good software engineering”.

Since the hardware is too complex to reason about our program’s performance from first principles, we have to go out into the field. That means profiling. A profiler, if you’ve never used one, is a tool that runs your program and tracks hardware resource use as the code executes. Simple ones show you how much time was spent in each function in your program. Sophisticated ones log data cache misses, instruction cache misses, branch mispredictions, memory allocations, and all sorts of other metrics.

There are many profilers out there for various operating systems and languages. On whatever platform you program, it is worth getting familiar with a decent profiler. You don’t need to be a master. I have learned things within minutes of throwing a program at a profiler that would have taken me days to discover on my own through trial and error. Profilers are wonderful, magical tools.

30 . 2Faster Hash Table Probing

Enough pontificating, let’s get some performance charts going up and to the right. The first optimization we’ll do, it turns out, is about the tiniest possible change we could make to our VM. There’s a lot of scaffolding around the change we’ll have to deal with, but the main optimization itself is pint-sized.

When I first got the bytecode virtual machine that clox is descended from working, I did what any self-respecting VM hacker would do. I cobbled together a couple of benchmarks, fired up a profiler, and ran those scripts through my interpreter. In a dynamically-typed language like Lox, a large fraction of user code is field accesses and method calls, so one of my benchmarks looked something like this:

class Zoo {
  init() {
    this.aardvark = 1;
    this.baboon   = 1;
    this.cat      = 1;
    this.donkey   = 1;
    this.elephant = 1;
    this.fox      = 1;
  }
  ant()    { return this.aardvark; }
  banana() { return this.baboon; }
  tuna()   { return this.cat; }
  hay()    { return this.donkey; }
  grass()  { return this.elephant; }
  mouse()  { return this.fox; }
}

var zoo = Zoo();
var sum = 0;
var start = clock();
while (sum < 100000000) {
  sum = sum + zoo.ant()
            + zoo.banana()
            + zoo.tuna()
            + zoo.hay()
            + zoo.grass()
            + zoo.mouse();
}

print clock() - start;
print sum;

If you’ve never seen a benchmark before, this might seem ludicrous. What is going on here? The program itself doesn’t intend to do anything useful. What it does do is call a bunch of methods and access a bunch of fields since those are the parts of the language we’re interested in. Fields and methods live in hash tables, so it takes care to populate at least a few interesting keys in those tables. That is all wrapped in a big loop to ensure our profiler has enough execution time to dig in and see where the cycles are going.

Before I tell you what my profiler showed me, spend a minute taking a few guesses. Where in clox’s codebase do you think the VM spent most of its time? Is there any code we’ve written in previous chapters that you suspect is particularly slow?

Here’s what I found: Naturally, the function with the greatest inclusive time is run(). (Inclusive time means the total time spent in some function and all other functions it callsthe total time between when you enter the function and when it returns.) Since run() is the main bytecode execution loop, it drives everything.

Inside run(), there are small chunks of time sprinkled in various cases in the bytecode switch for common instructions like OP_POP, OP_RETURN, and OP_ADD. The big heavy instructions are OP_GET_GLOBAL with 17% of the execution time, OP_GET_PROPERTY at 12%, and OP_INVOKE which takes a whopping 42% of the total running time.

So we’ve got three hotspots to optimize? Actually, no. Because it turns out those three instructions spend almost all of their time inside calls to the same function: tableGet(). That function claims a whole 72% of the execution time (again, inclusive). Now, in a dynamically-typed language, we expect to spend a fair bit of time looking stuff up in hash tablesit’s sort of the price of dynamism. But, still, wow.

30 . 2 . 1Slow key wrapping

If you take a look at tableGet(), you’ll see it’s mostly a wrapper around a call to findEntry() where the actual hash table lookup happens. To refresh your memory, here it is in full:

static Entry* findEntry(Entry* entries, int capacity,
                        ObjString* key) {
  uint32_t index = key->hash % capacity;
  Entry* tombstone = NULL;

  for (;;) {
    Entry* entry = &entries[index];

    if (entry->key == NULL) {
      if (IS_NIL(entry->value)) {
        // Empty entry.
        return tombstone != NULL ? tombstone : entry;
      } else {
        // We found a tombstone.
        if (tombstone == NULL) tombstone = entry;
      }
    } else if (entry->key == key) {
      // We found the key.
      return entry;
    }

    index = (index + 1) % capacity;
  }
}

When running that previous benchmarkon my machine, at leastthe VM spends 70% of the total execution time on one line in this function. Any guesses as to which one? No? It’s this:

  uint32_t index = key->hash % capacity;

That pointer dereference isn’t the problem. It’s the little %. It turns out the modulo operator is really slow. Much slower than other arithmetic operators. Can we do something better?

In the general case, it’s really hard to re-implement a fundamental arithmetic operator in user code in a way that’s faster than what the CPU itself can do. After all, our C code ultimately compiles down to the CPU’s own arithmetic operations. If there were tricks we could use to go faster, the chip would already be using them.

However, we can take advantage of the fact that we know more about our problem than the CPU does. We use modulo here to take a key string’s hash code and wrap it to fit within the bounds of the table’s entry array. That array starts out at eight elements and grows by a factor of two each time. We knowand the CPU and C compiler do notthat our table’s size is always a power of two.

Because we’re clever bit twiddlers, we know a faster way to calculate the remainder of a number modulo a power of two: masking. Let’s say we want to calculate 229 modulo 64. The answer is 37, which is not particularly apparent in decimal, but is clearer when you view those numbers in binary:

The bit patterns resulting from 229 % 64 = 37 and 229 & 63 = 37.

On the left side of the illustration, notice how the result (37) is simply the dividend (229) with the highest two bits shaved off? Those two highest bits are the bits at or to the left of the divisor’s single 1 bit.

On the right side, we get the same result by taking 229 and bitwise and-ing it with 63, which is one less than our original power of two divisor. Subtracting one from a power of two gives you a series of 1 bits. That is exactly the mask we need in order to strip out those two leftmost bits.

In other words, you can calculate a number modulo any power of two by simply and-ing it with that power of two minus one. I’m not enough of a mathematician to prove to you that this works, but if you think it through, it should make sense. We can replace that slow modulo operator with a fast subtraction followed by a very fast bitwise and. We simply change the offending line of code to:

  uint32_t index = key->hash & (capacity - 1);

We can go farther. The result of (capacity - 1) changes infrequently, only when the table grows or shrinks. Instead of performing the subtraction on each hash table lookup, we can cache the result and reuse it. In fact, if we cache the mask used to wrap a hash key into the table size, we don’t even need to store the capacity since we can easily calculate that by incrementing the mask.

Instead of storing the entry array capacity in Table, we’ll directly store the bit mask. We’ll keep the same capacity field but its meaning and value are now different. It now stores the mask and its value is always one less than the size of the entry array.

With that interpretation, to wrap a key, we simply apply the mask:

                        ObjString* key) {
table.c
in findEntry()
replace 1 line
  uint32_t index = key->hash & capacity; 
  Entry* tombstone = NULL;               
table.c, in findEntry(), replace 1 line

CPUs love bitwise operators, so it’s hard to improve on that.

Our linear probing search may need to wrap around the end of the array, so there is another modulo in findEntry() to update:

      // We found the key.         
      return entry;                
    }                              
                                   
table.c
in findEntry()
replace 1 line
    index = (index + 1) & capacity;
  }                                
table.c, in findEntry(), replace 1 line

This line didn’t show up in the profile since most searches don’t wrap. Before we can try out these changes and see how they affect the performance, we have to fix every piece of code that uses the capacity field and update it to work with the new value it represents. This is going to be kind of a chore. Sorry.

30 . 2 . 2Hash table changes

There are a few changes to make in adjustCapacity(). First, when we allocate the new array, we calculate the size from the mask:

static void adjustCapacity(Table* table, int capacity) {
table.c
in adjustCapacity()
replace 1 line
  Entry* entries = ALLOCATE(Entry, capacity + 1);       
  for (int i = 0; i < capacity; i++) {                  
table.c, in adjustCapacity(), replace 1 line

Then we iterate over the array to initialize the empty entries:

  Entry* entries = ALLOCATE(Entry, capacity + 1);
table.c
in adjustCapacity()
replace 1 line
  for (int i = 0; i <= capacity; i++) {          
    entries[i].key = NULL;                       
table.c, in adjustCapacity(), replace 1 line

The difference here is that the < checking to exit the loop has become a <=.

  table->count = 0;                           
table.c
in adjustCapacity()
replace 1 line
  for (int i = 0; i <= table->capacity; i++) {
    Entry* entry = &table->entries[i];        
table.c, in adjustCapacity(), replace 1 line

When we free the old array, the garbage collector wants to track how many bytes are released:

    table->count++;                                      
  }                                                      
                                                         
table.c
in adjustCapacity()
replace 1 line
  FREE_ARRAY(Entry, table->entries, table->capacity + 1);
  table->entries = entries;                              
table.c, in adjustCapacity(), replace 1 line

There’s going to be a lot of + 1 and <= in this section. This is fertile ground for subtle off-by-one bugs, so we should tread carefully to ensure we don’t get stung.

Moving up the abstraction stack, we get to the main hash table operations that call adjustCapacity() and findEntry(). When we grow the array before inserting a new entry into it, we need to calculate the capacity:

bool tableSet(Table* table, ObjString* key, Value value) {        
table.c
in tableSet()
replace 2 lines
  if (table->count + 1 > (table->capacity + 1) * TABLE_MAX_LOAD) {
    int capacity = GROW_CAPACITY(table->capacity + 1) - 1;        
    adjustCapacity(table, capacity);                              
table.c, in tableSet(), replace 2 lines

All the way at the beginning, when we first initialize the field:

  table->count = 0;     
table.c
in initTable()
replace 1 line
  table->capacity = -1; 
  table->entries = NULL;
table.c, in initTable(), replace 1 line

As you’ve seen, when we need to calculate the actual array size from the capacity field we get that by adding one. When the hash table is empty, the size is zero, so initializing capacity to -1 correctly yields zero when we increment it.

When copying all of the entries from one hash table to another, we iterate over the source table’s array:

void tableAddAll(Table* from, Table* to) {   
table.c
in tableAddAll()
replace 1 line
  for (int i = 0; i <= from->capacity; i++) {
    Entry* entry = &from->entries[i];        
table.c, in tableAddAll(), replace 1 line

That’s another < to <= change.

The findEntry() function has a sister function, tableFindStrings() that does a hash table lookup for interning strings. We need to make the same changes there that we made in findEntry(). We use capacity as a mask when wrapping the string’s hash key:

ObjString* tableFindString(Table* table, const char* chars, int length,
                           uint32_t hash) {                            
  if (table->count == 0) return NULL;                                  
                                                                       
table.c
in tableFindString()
replace 1 line
  uint32_t index = hash & table->capacity;                             
                                                                       
  for (;;) {                                                           
table.c, in tableFindString(), replace 1 line

And also when the linear probing wraps around:

      return entry->key;                  
    }                                     
                                          
table.c
in tableFindString()
replace 1 line
    index = (index + 1) & table->capacity;
  }                                       
table.c, in tableFindString(), replace 1 line

These are actual optimizations. This function is only called when interning strings, which wasn’t heavily stressed by our benchmark. But a Lox program that created lots of strings might noticeably benefit from this change.

30 . 2 . 3Memory manager changes

There are a few places in the memory management code where we access a hash table’s capacity. First, using a <= in the loop when we mark all of the entries in a table:

void markTable(Table* table) {                
table.c
in markTable()
replace 1 line
  for (int i = 0; i <= table->capacity; i++) {
    Entry* entry = &table->entries[i];        
table.c, in markTable(), replace 1 line

And another one in the loop to remove unmarked entries from the string table:

void tableRemoveWhite(Table* table) {         
table.c
in tableRemoveWhite()
replace 1 line
  for (int i = 0; i <= table->capacity; i++) {
    Entry* entry = &table->entries[i];        
table.c, in tableRemoveWhite(), replace 1 line

Finally, the garbage collector tracks the amount of memory freed, so it needs an accurate count of the entry array size:

void freeTable(Table* table) {                           
table.c
in freeTable()
replace 1 line
  FREE_ARRAY(Entry, table->entries, table->capacity + 1);
  initTable(table);                                      
table.c, in freeTable(), replace 1 line

That’s our last + 1! That was a lot of plumbing just to turn a couple of % operators into a &.

Let’s see if it was worth it. I tweaked that zoological benchmark to count how many batches of 10,000 calls it can run in ten seconds. More batches equals faster performance. On my machine using the unoptimized code, the benchmark gets through 3,192 batches. With this optimization, that jumps to 6,249.

Bar chart comparing the performance before and after the optimization.

That’s almost exactly twice as much work in the same amount of time. We made the VM twice as fast (usual caveat: on this benchmark). That is a massive win when it comes to optimization. Usually you feel good if you can claw a few percentage points here or there. Since methods, fields, and global variables are so prevalent in Lox programs, this tiny optimization improves performance across the board. Almost every Lox program benefits.

Now, the point of this section is not that the modulo operator is profoundly evil and you should stamp it out of every program you ever write. Nor is it that micro-optimization is a vital engineering skill. It’s rare that a performance problem has such a narrow, effective solution. We got lucky.

The point is that we didn’t know that the modulo operator was a performance drain until our profiler told us so. If we had wandered around our VM’s codebase blindly guessing at hotspots, we likely wouldn’t have noticed it. What I want you to take away from this is how important it is to have a profiler in your toolbox.

To reinforce that point, let’s go ahead and run the same benchmark in our now-optimized VM and see what the profiler shows us. On my machine, tableGet() is still a fairly large chunk of execution time. That’s to be expected for a dynamically-typed language. But it has dropped from 72% of the total execution time down to 35%. That’s much more in line with what we’d like to see and shows that our optimization didn’t just make the program faster, but made it faster in the way we expected. Profilers are useful for verifying solutions just as they are for discovering problems.

30 . 3NaN Boxing

This next optimization has a very different feel. Thankfully, despite the odd name, it does not involve punching your grandmother. It’s different, but not, like, that different. With our previous optimization, the profiler told us where the problem was and we merely had to use some ingenuity to come up with a solution.

This optimization is more subtle, and its performance effects more scattered across the virtual machine. The profiler won’t help us come up with this. Instead, it was invented by someone thinking deeply about the lowest levels of machine architecture.

Like the heading says, this optimization is called “NaN boxing” or sometimes “NaN tagging”. Personally I like the latter name because “boxing” tends to imply some kind of heap-allocated representation, but the former seems to be the more widely-used term. This technique changes how we represent values in the VM.

On a 64-bit machine, our Value type takes up 16 bytes. The struct has two fields, a type tag and a union for the payload. The largest fields in the union are an Obj pointer and a double, which are both 8 bytes. To keep the union field aligned to an 8-byte boundary, the compiler adds padding after the tag too:

Byte layout of the 16-byte tagged union Value.

That’s pretty big. If we could cut that down, then the VM could pack more values into the same amount of memory. Most computers have plenty of RAM these days, so the direct memory savings aren’t a huge deal. But a smaller representation means more Values fit in a cache line. That means fewer cache misses which affects speed.

If Values need to be aligned to their largest payload size and a Lox number or Obj pointer needs a full 8 bytes, how can we get any smaller? In a dynamically typed language like Lox, each value needs to carry not just its payload, but enough additional information to determine the value’s type at runtime. If a Lox number is already using the full 8 bytes, where could we squirrel away a couple of extra bits to tell the runtime “this is a number”?

This is one of the perennial problems for dynamic language hackers. It particularly bugs them because statically-typed languages don’t generally have this problem. The type of each value is known at compile time, so no extra memory is needed at runtime to track it. When your C compiler compiles a 32-bit int, the resulting variable gets exactly 32 bits of storage.

Dynamic language folks hate losing ground to the static camp, so they’ve come up with a number of very clever ways to pack type information and a payload into a small number of bits. NaN boxing is one of those. It’s a particularly good fit for languages like JavaScript and Lua where all numbers are double-precision floating point. Lox is in that same boat.

30 . 3 . 1What is (and is not) a number?

Before we start optimizing, we need to really understand how our friend the CPU represents floating-point numbers. Almost all machines today use the same scheme, encoded in the venerable scroll IEEE 754, known to mortals as the “IEEE Standard for Floating-Point Arithmetic”.

In the eyes of your computer, a 64-bit double-precision IEEE floating-point number looks like this:

Bit representation of an IEEE 754 double.

I know that’s a little vague, but this chapter isn’t a deep dive on floating-point representation. If you want to know how the exponent and mantissa play together, there are already better explanations out there than I could write.

The important part for our purposes is that the spec carves out a special case exponent. When all of the exponent bits are set, then instead of just representing a really big number, the value has a different meaning. These values are “not a number” (hence, “NaN”) values. They represent concepts like infinity or the result of division by zero.

Any double whose exponent bits are all set is a NaN, regardless of the mantissa bits. That means there’s lots and lots of different NaN bit patterns. IEEE 754 divides those into two categories. Values where the highest mantissa bit is 0 are called signalling NaNs and the others are quiet NaNs. Signalling NaNs are intended to be the result of erroneous computations, like division by zero. A chip may detect when one of these values is produced and abort a program completely. They may self-destruct if you try to read one.

Quiet NaNs are supposed to be safer to use. They don’t represent useful numeric values, but they should at least not set your hand on fire if you touch them.

Every double with all of its exponent bits set and its highest mantissa bit set is a quiet NaN. That leaves 52 bits unaccounted for. We’ll avoid one of those so that we don’t step on Intel’s “QNaN Floating-Point Indefinite” value, leaving us 51 bits. Those remaining bits can be anything. We’re talking 2,251,799,813,685,248 unique quiet NaN bit patterns.

The bits in a double that make it a quiet NaN.

This means a 64-bit double has enough room to store all of the various different numeric floating-point values and also has room for another 51 bits of data that we can use however we want. That’s plenty of room to set aside a couple of bit patterns to represent Lox’s nil, true, and false values. But what about Obj pointers? Don’t pointers need a full 64 bits too?

Fortunately, we have another trick up our other sleeve. Yes, technically pointers on a 64-bit architecture are 64 bits. But, no architecture I know of actually uses that entire address space. Instead, most widely-used chips today only ever use the low 48 bits. The remaining 16 bits are either unspecified or always zero.

If we’ve got 51 bits, we can stuff a 48-bit pointer in there with three bits to spare. Those three bits are just enough to store tiny type tags to distinguish between nil, Booleans, and Obj pointers.

That’s NaN boxing. Within a single 64-bit double, you can store all of the different floating-point numeric values, a pointer, or any of a couple of other special sentinel values. Half the memory usage of our current Value struct, while retaining all of the fidelity.

What’s particularly nice about this representation is that there is no need to convert a numeric double value into a “boxed” form. Lox numbers are just normal 64-bit doubles. We still need to check their type before we use them, since Lox is dynamically typed, but we don’t need to do any bit shifting or pointer indirection to go from “value” to “number”.

For the other value types, there is a conversion step, of course. But, fortunately, our VM hides all of the mechanism to go from values to raw types behind a handful of macros. Rewrite those to implement NaN boxing and the rest of the VM should just work.

30 . 3 . 2Conditional support

I know the details of this new representation aren’t clear in your head yet. Don’t worry, they will crystallize as we work through the implementation. Before we get to that, we’re going to put some compile-time scaffolding in place.

For our previous optimization, we rewrote the previous slow code and called it done. This one is a little different. NaN boxing relies on some very low-level details of how a chip represents floating-point numbers and pointers. It probably works on most CPUs you’re likely to encounter, but you can never be totally sure.

It would suck if our VM completely lost support for an architecture just because of its value representation. To avoid that, we’ll maintain support for both the old tagged union implementation of Value and the new NaN-boxed form. We select which representation we want at compile time using this flag:

#include <stdint.h>     
                        
common.h
#define NAN_BOXING      
#define DEBUG_PRINT_CODE
common.h

If that’s defined, the VM uses the new form. Otherwise, it reverts to the old style. The few pieces of code that care about the details of the value representationmainly the handful of macros for wrapping and unwrapping Valuesvary based on whether this flag is set. The rest of the VM can continue along its merry way.

Most of the work happens in the “value” module where we add a section for the new type:

typedef struct sObjString ObjString;
                                    
value.h
#ifdef NAN_BOXING                   

typedef uint64_t Value;             

#else                               

typedef enum {                      
value.h

When NaN boxing is enabled, the actual type of a Value is a flat unsigned 64-bit integer. We could use double instead which would make the macros for dealing with Lox numbers a little simpler. But all of the other macros need to do bitwise operations and uint64_t is a much friendlier type for that. Outside of this module, the rest of the VM doesn’t really care one way or the other.

Before we start re-implementing those macros, we close the #else branch of the #ifdef at the end of the definitions for the old representation:

#define OBJ_VAL(object)   ((Value){ VAL_OBJ, { .obj = (Obj*)object } })
value.h

#endif                                                                 
                                                                       
typedef struct {                                                       
value.h

Our remaining task is simply to fill in that first #ifdef section with new implementations of all the stuff already in the #else side. We’ll work through it one value type at a time, from easiest to hardest.

30 . 3 . 3Numbers

We’ll start with numbers since they have the most direct representation under NaN boxing. To “convert” a C double to a NaN-boxed clox Value, we don’t need to touch a single bitthe representation is exactly the same. But we do need to convince our C compiler of that fact, which we made harder by defining Value to be a uint64_t.

We need to get the compiler to take a set of bits that it thinks are a double and use those same bits as a uint64_t, or vice versacalled type punning. The closest thing to an officially supported way of doing that in C is through a union:

typedef uint64_t Value;
value.h

typedef union {        
  uint64_t bits;       
  double num;          
} DoubleUnion;         
                       
#else                  
value.h

The idea is that you store a value into one of the union’s fields and then read out the other field. Voilà, the bits have been transmuted from the first field’s type to the second’s. Alas, I don’t know an elegant way to pack those operations into a single expression, so the conversion macros will call out to helper functions. Here’s the first macro:

typedef uint64_t Value;                
value.h

#define NUMBER_VAL(num) numToValue(num)
                                       
typedef union {                        
value.h

It hands off the double to:

} DoubleUnion;                              
value.h
add after union DoubleUnion

static inline Value numToValue(double num) {
  DoubleUnion data;                         
  data.num = num;                           
  return data.bits;                         
}                                           
                                            
#else                                       
value.h, add after union DoubleUnion

This creates a temporary union, stores the double in it, and then yanks out the uint64_t whose bits overlap that in memory. This code appears slow: a function call, a local variable, an assignment, and a read. Fortunately, compilers are smart enough to optimize it all away.

“Unwrapping” a Lox number is the mirror image:

typedef uint64_t Value;                
value.h

#define AS_NUMBER(v)    valueToNum(v)  
                                       
#define NUMBER_VAL(num) numToValue(num)
value.h

That macro calls this function:

} DoubleUnion;                                
value.h
add after union DoubleUnion

static inline double valueToNum(Value value) {
  DoubleUnion data;                           
  data.bits = value;                          
  return data.num;                            
}                                             
                                              
static inline Value numToValue(double num) {  
value.h, add after union DoubleUnion

It works exactly the same except we flip which field we write and which we read. Again, the compiler will eliminate all of it.

That was a lot of code to ultimately do nothing but silence the C type checker. Doing a runtime type test on a Lox number is a little more interesting. If all we have are exactly the bits for a double, how do we tell that it is a double?

It’s time to get bit twiddling:

typedef uint64_t Value;                       
value.h

#define IS_NUMBER(v)    (((v) & QNAN) != QNAN)
                                              
#define AS_NUMBER(v)    valueToNum(v)         
value.h

We know that every Value that is not a number will use a special quiet NaN representation. And we presume we have correctly avoided any of the meaningful NaN representations that may actually be produced by doing arithmetic on numbers.

If the double has all of its NaN bits set, and the quiet NaN bit set, and one more for good measure, we can be pretty certain it is one of the bit patterns we ourselves have set aside for other types. To check that, we mask out all of the bits except for our set of quiet NaN bits. If all of those bits are set, it must be a NaN-boxed value of some other Lox type.

The set of quiet NaN bits are declared like this:

#ifdef NAN_BOXING                              
value.h

#define QNAN     ((uint64_t)0x7ffc000000000000)
                                               
typedef uint64_t Value;                        
value.h

It would be nice if C supported binary literals. But if you do the conversion, you’ll see that value is the same as:

The quiet NaN bits.

This is exactly all of the exponent bits, plus the quiet NaN bit, plus one extra to dodge that Intel value.

30 . 3 . 4Nil, true, and false

The next type to handle is nil. That’s pretty simple since there’s only one nil value and thus we only need a single bit pattern to represent it. There are two other singleton values, the two Booleans true and false. This calls for three unique bit patterns.

Two bits gives us four different combinations, which is plenty. We claim the two lowest bits of our unused mantissa space as a “type tag” to determine which of these three singleton values we’re looking at. The three type tags are defined like so:

#define QNAN     ((uint64_t)0x7ffc000000000000)
value.h

#define TAG_NIL   1 // 01.                     
#define TAG_FALSE 2 // 10.                     
#define TAG_TRUE  3 // 11.                     
                                               
typedef uint64_t Value;                        
value.h

Our representation of nil is thus all of the bits required to define our quiet NaN representation along with the nil type tag bits:

The bit representation of the nil value.

In code, that’s:

#define AS_NUMBER(v)    valueToNum(v)                      
                                                           
value.h
#define NIL_VAL         ((Value)(uint64_t)(QNAN | TAG_NIL))
#define NUMBER_VAL(num) numToValue(num)                    
value.h

We simply bitwise or the quiet NaN bits and the type tag, and then do a little cast dance to teach the C compiler what we want those bits to mean.

Since nil has only a single bit representation, we can use equality on uint64_t to see if a Value is nil:

typedef uint64_t Value;                       
                                              
value.h
#define IS_NIL(v)       ((v) == NIL_VAL)      
#define IS_NUMBER(v)    (((v) & QNAN) != QNAN)
value.h

You can guess how we define the true and false values:

#define AS_NUMBER(v)    valueToNum(v)                        
                                                             
value.h
#define FALSE_VAL       ((Value)(uint64_t)(QNAN | TAG_FALSE))
#define TRUE_VAL        ((Value)(uint64_t)(QNAN | TAG_TRUE)) 
#define NIL_VAL         ((Value)(uint64_t)(QNAN | TAG_NIL))  
value.h

The bits look like this:

The bit representation of the true and false values.

To convert a C bool into a Lox Boolean, we rely on these two singleton values and the good old conditional operator:

#define AS_NUMBER(v)    valueToNum(v)                        
                                                             
value.h
#define BOOL_VAL(b)     ((b) ? TRUE_VAL : FALSE_VAL)         
#define FALSE_VAL       ((Value)(uint64_t)(QNAN | TAG_FALSE))
value.h

There’s probably a more clever bitwise way to do this, but my hunch is that the compiler can figure one out faster than I can. Going the other direction is simpler:

#define IS_NUMBER(v)    (((v) & QNAN) != QNAN)
                                              
value.h
#define AS_BOOL(v)      ((v) == TRUE_VAL)     
#define AS_NUMBER(v)    valueToNum(v)         
value.h

Since we know there are exactly two Boolean bit representations in Loxunlike in C where any non-zero value can be considered “true”if it ain’t true, it must be false. This macro does assume you only call it on a Value that you know is a Lox Boolean. To check that, there’s one more macro:

typedef uint64_t Value;                                 
                                                        
value.h
#define IS_BOOL(v)      (((v) & FALSE_VAL) == FALSE_VAL)
#define IS_NIL(v)       ((v) == NIL_VAL)                
value.h

That looks a little weird. It kind of looks like it only checks for false but what about true? A more obvious macro would look like:

#define IS_BOOL(v) ((v) == TRUE_VAL || (v) == FALSE_VAL)

Unfortunately, that’s not safe. The expansion mentions v twice which means if that expression has any side effects, they will be executed twice. We could have the macro call out to a separate function, but, ugh, what a chore.

Instead, we can take advantage of the bits we chose for the type tags. Note that both true and false have the second bit set and nil does not. If our Value has all of its quiet NaN bits set, we know it’s not a number. And if the second bit is set, we know it’s not nil. It must be true or false. Conveniently, FALSE_VAL is just the right bit pattern to check both the quiet NaN bits and that second tag bit, so we just test against that.

TRUE_VAL & FALSE_VAL is the same as FALSE_VAL & FALSE_VAL.

This is more subtle than I like to be with my bit hackery, but here we are.

30 . 3 . 5Objects

The last value type is the hardest. Unlike the singleton values, there are billions of different pointer values we need to box inside a NaN. This means we need both some kind of tag to indicate that these particular NaNs are Obj pointers and we need room for the addresses themselves.

The tag bits we used for the singleton values are in the region where I decided to store the pointer itself, so we can’t easily use a different bit there to indicate that the value is an object reference. However, there is another bit we aren’t using. Since all our NaN values are not numbersit’s right there in the namethe sign bit isn’t used for anything. We’ll go ahead and use that as the type tag for objects. If one of our quiet NaNs has its sign bit set, then it’s an Obj pointer. Otherwise, it must be one of the previous singleton values.

If the sign bit is set, then the remaining low bits store the pointer to the Obj:

Bit representation of an Obj* stored in a Value.

To convert a raw Obj pointer to a Value, we simply take the pointer and set all of the quiet NaN bits and the sign bit:

#define NUMBER_VAL(num) numToValue(num)                  
value.h
#define OBJ_VAL(obj) \                                   
    (Value)(SIGN_BIT | QNAN | (uint64_t)(uintptr_t)(obj))
                                                         
typedef union {                                          
value.h

The pointer itself is a full 64 bits and, in principle, it could thus overlap with some of those quiet NaN and sign bits. But in practice, at least on the architectures I’ve tested, everything above the 48-th bit in a pointer is always zero. There’s a lot of casting going on here, which I’ve found is necessary to satisfy some of the pickiest C compilers, but the end result is just munging some bits together.

We define the sign bit like so:

#ifdef NAN_BOXING                              
                                               
value.h
#define SIGN_BIT ((uint64_t)0x8000000000000000)
#define QNAN     ((uint64_t)0x7ffc000000000000)
                                               
value.h

To get the Obj pointer back out, we simply mask off all of those extra bits:

#define AS_NUMBER(v)    valueToNum(v)                                
value.h
#define AS_OBJ(v)       ((Obj*)(uintptr_t)((v) & ~(SIGN_BIT | QNAN)))
                                                                     
#define BOOL_VAL(b)     ((b) ? TRUE_VAL : FALSE_VAL)                 
value.h

The tilde (~), if you haven’t done enough bit munging to encounter it before, is bitwise not. It toggles all ones and zeroes in its operand. By masking the value with the bitwise negation of the quiet NaN and sign bits, we clear those bits and let the pointer bits remain.

One last macro:

#define IS_NUMBER(v)    (((v) & QNAN) != QNAN)                          
value.h
#define IS_OBJ(v)       (((v) & (QNAN | SIGN_BIT)) == (QNAN | SIGN_BIT))
                                                                        
#define AS_BOOL(v)      ((v) == TRUE_VAL)                               
value.h

A Value storing an Obj pointer has its sign bit set, but so does any negative number. To tell if a Value is an Obj pointer, we need to check that both the sign bit and all of the quiet NaN bits are set. This is similar to how we detect the type of the singleton values except this time we use the sign bit as the tag.

30 . 3 . 6Value functions

The rest of the VM usually goes through the macros when working with Values, so we are almost done. However, there are a couple of functions in the “value” module that peek inside the otherwise black box of Value and work with its encoding directly. We need to fix those too.

The first is printValue(). It has separate code for each value type. We no longer have an explicit type enum we can switch on so instead we use a series of type tests to handle each kind of value:

void printValue(Value value) {                
value.c
in printValue()
#ifdef NAN_BOXING                             
  if (IS_BOOL(value)) {                       
    printf(AS_BOOL(value) ? "true" : "false");
  } else if (IS_NIL(value)) {                 
    printf("nil");                            
  } else if (IS_NUMBER(value)) {              
    printf("%g", AS_NUMBER(value));           
  } else if (IS_OBJ(value)) {                 
    printObject(value);                       
  }                                           
#else                                         
  switch (value.type) {                       
value.c, in printValue()

This is technically a tiny bit slower than a switch, but compared to the overhead of actually writing to a stream, it’s negligible.

We still support the original tagged union representation, so we keep the old code and enclose it in the #else conditional section:

  }   
value.c
in printValue()
#endif
}     
value.c, in printValue()

The other operation is testing two values for equality:

bool valuesEqual(Value a, Value b) { 
value.c
in valuesEqual()
#ifdef NAN_BOXING                    
  return a == b;                     
#else                                
  if (a.type != b.type) return false;
value.c, in valuesEqual()

It doesn’t get much simpler than that! If the two bit representations are identical, the values are equal. That does the right thing for the singleton values since each has a unique bit representation and they are only equal to themselves. It also does the right thing for Obj pointers, since objects use identity for equalitytwo Obj references are only equal if they point to the exact same object.

It’s mostly correct for numbers too. Most floating-point numbers with different bit representations are distinct numeric values. Alas, IEEE 754 contains a pothole to trip us up. For reasons that aren’t entirely clear to me, the spec mandates that NaN values are not equal to themselves. This isn’t a problem for the special quiet NaNs that we are using for our own purposes. But it’s possible to produce a “real” arithmetic NaN in Lox and if we want to correctly implement IEEE 754 numbers then the resulting value is not supposed to be equal to itself. More concretely:

var nan = 0/0;
print nan == nan;

IEEE 754 says this program is supposed to print “false”. It does the right thing with our old tagged union representation because the VAL_NUMBER case applies == to two values that the C compiler knows are doubles. Thus the compiler generates the right CPU instruction to perform an IEEE floating-point equality.

Our new representation breaks that by defining Value to be a uint64_t. If we want to be fully compliant with IEEE 754, we need to handle this case:

#ifdef NAN_BOXING                                                       
value.c
in valuesEqual()
  if (IS_NUMBER(a) && IS_NUMBER(b)) return AS_NUMBER(a) == AS_NUMBER(b);
  return a == b;                                                        
value.c, in valuesEqual()

I know, it’s weird. And there is a performance cost to doing this type test every time we check two Lox values for equality. If we are willing to sacrifice a little compatibilitywho really cares if NaN is not equal to itself?we could leave this off. I’ll leave it up to you to decide how pedantic you want to be.

Finally, we close the conditional compilation section around the old implementation:

  }   
value.c
in valuesEqual()
#endif
}     
value.c, in valuesEqual()

And that’s it. This optimization is complete, as is our clox virtual machine. That was the last line of new code in the book.

30 . 3 . 7Evaluating performance

The code is done, but we still need to figure out if we actually made anything better with these changes. Evaluating an optimization like this is very different from the previous one. There, we had a clear hotspot visible in the profiler. We fixed that part of the code and could instantly see the hotspot get faster.

The effects of changing the value representation are more diffused. The macros are expanded in place wherever they are used, so the performance changes are spread across the codebase in a way that’s hard for many profilers to track well, especially in an optimized build.

We also can’t easily reason about the effects of our change. We’ve made values smaller, which reduces cache misses all across the VM. But the actual real-world performance effect of that change is highly dependent on the memory use of the Lox program being run. A tiny Lox microbenchmark may not have enough values scattered around in memory for the effect to be noticeable, and even things like the addresses handed out to us by the C memory allocator can impact the results.

If we did our job right, basically everything gets a little faster, especially on larger, more complex Lox programs. But it is possible that the extra bitwise operations we do when NaN-boxing values nullifies the gains from the better memory use. Doing performance work like this is unnerving because you can’t easily prove that you’ve made the VM better. You can’t point to a single surgically targeted microbenchmark and say, “There, see?”

Instead, what we really need is a suite of larger benchmarks. Ideally, they would be distilled from real-world applicationsnot that such a thing exists for a toy language like Lox. Then we can measure the aggregate performance changes across all of those. I did my best to cobble together a handful of larger Lox programs. On my machine, the new value representation seems to make everything roughly 10% faster across the board.

That’s not a huge improvement, especially compared to the profound effect of making hash table lookups faster. I added this optimization in large part because it’s a good example of a certain kind of performance work you may experience, and, honestly, because I think it’s technically really cool. It might not be the first thing I reached for if I were seriously trying to make clox faster. There is probably other lower-hanging fruit.

But, if you find yourself working on a program where all of the easy wins have been taken, then at some point you may want to think about tuning your value representation. This chapter has hopefully shined a light on some of the options you have in that area.

30 . 4Where to Next

We’ll stop here with the Lox language and our two interpreters. We could tinker on it forever, adding new language features and clever speed improvements. But, for this book, I think we’ve reached a natural place to call our work complete. I won’t rehash everything we’ve learned in the past many pages. You were there with me and you remember. Instead, I’d like to take a minute to talk about where you might go from here. What is the next step in your programming language journey?

Most of you probably won’t spend a significant part of your career working in compilers or interpreters. It’s a pretty small slice of the computer science pie, and an even smaller segment of software engineering in industry. That’s OK. Even if you never work on a compiler again in your life, you will certainly use one, and I hope this book has equipped you with a better understanding of how the programming languages you use are designed and implemented.

You have also learned a handful of important, fundamental data structures, and gotten some practice doing low-level profiling and optimization work. That kind of expertise is helpful no matter what domain you program in.

I also hope I gave you a new way of looking at and solving problems. Even if you never work on a language again, you may be surprised to discover how many programming problems can be seen as language-like. Maybe that report generator you need to write can be modeled as a series of stack-based “instructions” that the generator “executes”. That user interface you need to render looks an awful lot like traversing an AST.

If you do want to go further down the programming language rabbit hole, here are some suggestions for which branches in the tunnel to explore:

Or maybe this book has satisfied your craving and you’ll stop here. Whichever way you go, or don’t go, there is one lesson I hope to lodge in your heart. Like I was, you may have initially been intimidated by programming languages. But in these chapters, you’ve seen that even really challenging material can be tackled by us mortals if we get our hands dirty and take it a step at a time. If you can handle compilers and interpreters, you can do that with anything you put your mind to.

Challenges

Assigning homework on the last day of school seems cruel but if you really want something to do during your summer vacation:

  1. Fire up your profiler, run a couple of benchmarks, and look for other hotspots in the VM. Do you see anything in the runtime that you can improve?

  2. Many strings in real-world user programs are small, often only a character or two. This is less of a concern in clox because we intern strings, but most VMs don’t. For those that don’t, heap-allocating a tiny character array for each of those little strings and then representing the value as a pointer to that array is wasteful. Often the pointer is larger than the string’s characters. A classic trick is to have a separate value representation for small strings that stores the characters inline in the value.

    Starting from clox’s original tagged union representation, implement that optimization. Write a couple of relevant benchmarks and see if it helps.

  3. Reflect back on your experience with this book. What parts of it worked well for you? What didn’t? Was it easier for you to learn bottom-up or top-down? Did the illustrations help or distract? Did the analogies clarify or confuse?

    The more you understand your personal learning style, the more effectively you can upload knowledge into your head. You can specifically target material that teaches the way you learn best.