About this blog

Hi, I'm Peter Bex, a Scheme and free software enthusiast from the Netherlands. See my user page on the Chicken wiki if you would like to check out some of my hacks. A more complete list can be found on my Ohloh profile page.


I am also available for hire as a freelance developer!

The 5 most recent posts (archive) Atom feed

One of CHICKEN's coolest features has to be its unique approach to garbage collection. When someone asked about implementation details (hi, Arthur!), I knew this would make for an interesting blog post. This post is going to be long and technical, so hold on to your hats! Don't worry if you don't get through this in one sitting.

Prerequisites

There's a whole lot of stuff that we'll need to explain before we get to the actual garbage collector. CHICKEN's garbage collection (GC) strategy is deeply intertwined with its compilation strategy, so I'll start by explaining the basics of that, before we can continue (pun intended) with the actual GC stuff.

A short introduction to continuation-passing style

The essence of CHICKEN's design is a simple yet brilliant idea by Henry Baker, described in his paper CONS Should Not CONS Its Arguments, Part II: Cheney on the M.T.A.. The paper is pretty terse, but it's well-written, so I recommend you check it out before reading on. If you grok everything in it, you probably won't get much out of my blog post and you can stop reading now. If you don't grok it, it's probably a good idea to re-read it again later.

Baker's approach assumes a Scheme to C compiler which uses continuation-passing style (CPS) as an internal representation. This is the quintessential internal representation of Scheme programs, going all the way back to the first proper Scheme compiler, RABBIT.

Guy L. Steele (RABBIT's author) did not use CPS to make garbage collection easier. In fact, RABBIT had no GC of its own, as it relied on MacLISP as a target language (which compiled to PDP-10 machine code and had its own garbage collector). Instead, continuations allowed for efficient implementation of nested procedure calls. It eliminated the need for a stack to keep track of this nesting by simply returning the "next thing to do" to a driver loop which took care of invoking it. This made it possible to write down iterative algorithms as a recursive function without causing a stack overflow.

Let's consider a silly program which sums up all the numbers in a list, and shows the result multiplied by two:

(define (calculate-sum lst result)
  (if (null? lst)
      result
      (calculate-sum (cdr lst) (+ result (car lst)))))

(define (show-sum lst)
  (print-number (* 2 (calculate-sum lst 0))))

(show-sum '(1 2 3))

A naive compilation to C would look something like this (brutally simplified):

void entry_point() {
  toplevel();
  exit(0); /* Assume exit(1) is explicitly called elsewhere in case of error. */
}

void toplevel() {
  /* SCM_make_list() & SCM_fx() allocate memory.  "fx" stands for "fixnum". */
  SCM_obj *lst = SCM_make_list(3, SCM_fx(1), SCM_fx(2), SCM_fx(3));
  show_sum(lst);
}

SCM_obj* show_sum(SCM_obj *lst) {
  SCM_obj result = calculate_sum(lst, SCM_fx(0));
  /* SCM_fx_times() allocates memory. */
  return SCM_print_number(SCM_fx_times(SCM_fx(2), result));
}

SCM_obj* calculate_sum(SCM_obj *lst, SCM_obj *result) {
  if (lst == SCM_NIL) { /* Optimised */
    return result;
  } else {
    /* SCM_fx_plus() allocates memory. */
    SCM_obj *tmp = SCM_cdr(lst);
    SCM_obj *tmp2 = SCM_fx_plus(result, SCM_car(lst));
    return calculate_sum(tmp, tmp2); /* Recur */
  }
}

SCM_obj *SCM_print_number(SCM_obj *data) {
  printf("%d\n", SCM_fx_to_integer(data));
  return SCM_VOID;
}

This particular implementation probably can't use a copying garbage collector like CHICKEN uses, because the SCM_obj pointers which store the Scheme objects' locations would all become invalid. But let's ignore that for now.

Due to the recursive call in calculate_sum(), the stack just keeps growing, and eventually we'll get a stack overflow if the list is too long. Steele argued that this is a silly limitation which results in the proliferation of special-purpose "iteration" constructs found in most languages. Also, he was convinced that this just cramps the programmer's style: we shouldn't have to think about implementation details like the stack size. In his time people often used goto instead of function calls as a performance hack. This annoyed him enough to write a rant about it, which should be required reading for all would-be language designers!

Anyway, a compiler can transparently convert our Scheme program into CPS, which would look something like this after translation to C:

/* Set up initial continuation & toplevel call. */
void entry_point() {
  SCM_cont *cont = SCM_make_cont(1, &toplevel, SCM_exit_continuation);
  SCM_call *call = SCM_make_call(0, cont);
  SCM_driver_loop(call);
}

void SCM_driver_loop(SCM_call *call) {
  /* The trampoline to which every function returns its continuation. */
  while(true)
    call = SCM_perform_continuation_call(call);
}

SCM_call *toplevel(SCM_cont *cont) {
  SCM_cont *next = SCM_make_cont(1, &show_sum, cont);
  SCM_obj *lst = SCM_make_list(3, SCM_fx(1), SCM_fx(2), SCM_fx(3));
  return SCM_make_call(1, next, lst);
}

SCM_call *show_sum(SCM_cont *cont, SCM_obj *lst) {
  SCM_cont *next = SCM_make_cont(1, &show_sum_continued, cont);
  SCM_cont *now = SCM_make_cont(2, &calculate_sum, next);
  return SCM_make_call(2, now, lst, SCM_fx(0));
}

SCM_call *calculate_sum(SCM_cont *cont, SCM_obj *lst, SCM_obj *result) {
  if (lst == SCM_NIL) { /* Optimised */
    return SCM_make_call(1, cont, result);
  } else {
    SCM_obj *tmp = SCM_cdr(lst);
    SCM_obj *tmp2 = SCM_fx_plus(result, SCM_car(lst));
    SCM_cont *now = SCM_make_cont(2, &calculate_sum, cont);
    return SCM_make_call(2, now, tmp, tmp2); /* "Recur" */
  }
}

SCM_call *show_sum_continued(SCM_cont *cont, SCM_obj *result) {
  SCM_cont *now = SCM_make_cont(1, &SCM_print_number, cont);
  SCM_obj *tmp = SCM_fx_times(SCM_fx(2), result);
  return SCM_make_call(1, now, tmp);
}

SCM_call *SCM_print_number(SCM_cont *cont, SCM_obj *data) {
  printf("%d\n", SCM_fx_to_integer(data));
  return SCM_make_call(1, cont, SCM_VOID);
}

In the above code, there are two new data types: SCM_cont and SCM_call.

An SCM_cont represents a Scheme continuation as a C function's address, the number of arguments which it expects (minus one) and another continuation, which indicates where to continue after the C function has finished. This sounds recursive, but as you can see the very first continuation created by entry_point() is a specially prepared one which will cause the process to exit.

An SCM_call is returned to the driver loop by every generated C function: this holds a continuation and the arguments with which to invoke it. SCM_perform_continuation_call() extracts the SCM_cont from the SCM_call and invokes its C function with its continuation and the arguments from the SCM_call. We won't dwell on the details of its implementation now, but assume this is some magic which just works.

You'll also note that the primitives SCM_car(), SCM_cdr(), SCM_fx_plus() and SCM_fx_times() do not accept a continuation. This is a typical optimisation: some primitives can be inlined by the compiler. However, this is not required: you can make them accept a continuation as well, at the cost of further splintering the C functions into small sections; the calculate_sum() function would be split up into 4 separate functions if we did that.

Anyway, going back to the big picture we can see that this continuation-based approach consumes a more or less constant amount of stack space, because each function returns to driver_loop. Baker's fundamental insight was that the stack is there anyway (and it will be used by C), and if we don't need it for tracking function call nesting, why not use it for something else? He proposed to allocate all newly created objects on the stack. Because the stack would hopefully fit the CPU's cache in its entirety, this could give quite a performance benefit.

Generational collection

To understand why keeping new data together on the stack can be faster, it's important to know that most objects are quite short-lived. Most algorithms involve intermediate values, which are accessed quite a bit during a calculation but are no longer needed afterwards. These values need to be stored somewhere in memory. Normally you would store them together with all other objects in the main heap, which may cause fragmentation of said heap. Fragmentation means that memory references may cross page boundaries. This is slow, because it will clear out the CPU's memory cache and may even require swapping it in, if the machine is low on memory.

On top of that, generating a lot of intermediate values means generating a lot of garbage, which will trigger many GCs during which a lot of these temporary objects will be cleaned up. However, during these GCs, the remaining longer-lived objects must also be analysed before it can be decided they can stick around.

This is rather wasteful, and it turns out we can avoid doing so much work by categorising objects by age. Objects that have just been created belong to the first generation and are stored in their own space (called the nursery - I'm not kidding!), while those that have survived several GC events belong to older generations, which each have their own space reserved for them. By keeping different generations separated, you do not have to examine long-lived objects of older generations (which are unlikely to be collected) when collecting garbage in a younger generation. This can save us a lot of wasted time.

Managing data on the stack

The Cheney on the M.T.A. algorithm as used by CHICKEN involves only two generations; one generation consisting of newly created objects and the other generation consisting of older objects. In this algorithm, new objects get immediately promoted (or tenured) to the old generation after a GC of the nursery (or stack). Such a GC is called a minor GC, whereas a GC of the heap is called a major GC.

This minor GC is where the novelty lies: objects are allocated on the stack. You might wonder how that can possibly work, considering the lengths I just went through to explain how CPS conversion gets rid of the stack. Besides, by returning to the trampoline function whenever a new continuation is invoked, anything you'd store on the stack would need to get purged (that's how the C calling convention works).

That's right! The way to make this work is pretty counter-intuitive: we go all the way back to the first Scheme to C conversion I showed you and make it even worse. Whenever we want to invoke a continuation, we just call its function. That means that the example program we started out with would compile to this:

/* Same as before */
void entry_point() {
  SCM_cont *cont = SCM_make_cont(1, &toplevel, SCM_exit_continuation);
  SCM_call *call = SCM_make_call(0, cont);
  SCM_driver_loop(call);
}

SCM_call *saved_cont_call; /* Set by SCM_save_call, read by driver_loop */
jmp_buf empty_stack_state; /* Set by driver_loop, read by minor_GC */

void SCM_driver_loop(SCM_call *call) {
  /* Save registers (including stack depth and address in this function) */
  if (setjmp(empty_stack_state))
    call = saved_cont_call; /* Got here via longjmp()? Use stored call */

  SCM_perform_continuation_call(call);
}

void SCM_minor_GC() {
  /* ...
     Copy live data from stack to heap, which is a minor GC.  Described later.
     ... */
  longjmp(empty_stack_state); /* Restore registers (jump back to driver_loop) */
}

void toplevel(SCM_cont *cont) {
  if (!fits_on_stack(SCM_CONT_SIZE(0) + SCM_CALL_SIZE(1) +
                     SCM_FIXNUM_SIZE * 3 + SCM_PAIR_SIZE * 3)) {
    SCM_save_call(0, &toplevel, cont); /* Mutates saved_cont_call */
    SCM_minor_GC(); /* Will re-invoke this function from the start */
  } else {
    /* The below stuff will all fit on the stack, as calculated in the if() */
    SCM_cont *next = SCM_make_cont(1, &show_sum, cont);
    SCM_obj *lst = SCM_make_list(3, SCM_fx(1), SCM_fx(2), SCM_fx(3));
    SCM_call *call = SCM_make_call(1, next, lst);
    SCM_perform_continuation_call(call);
  }
}

void show_sum(SCM_cont *cont, SCM_obj *lst) {
  if (!fits_on_stack(SCM_CONT_SIZE(0) * 2 +
                     SCM_CALL_SIZE(2) + SCM_FIXNUM_SIZE)) {
    SCM_save_call(1, &show_sum, cont, lst);
    SCM_minor_GC();
  } else {
    SCM_cont *next = SCM_make_cont(1, &show_sum_continued, cont);
    SCM_cont *now = SCM_make_cont(2, &calculate_sum, next);
    SCM_call *call = SCM_make_call(2, now, lst, SCM_fx(0));
    SCM_perform_continuation_call(call);
  }
}

void calculate_sum(SCM_cont *cont, SCM_obj *lst, SCM_obj *result) {
  /* This calculation is overly pessimistic as it counts both arms
     of the if(), but this is acceptable */
  if (!fits_on_stack(SCM_CALL_SIZE(1) + SCM_FIXNUM_SIZE +
                     SCM_CONT_SIZE(1) + SCM_CALL_SIZE(2))) {
    SCM_save_call(2, &calculate_sum, cont, lst, result);
    SCM_minor_GC();
  } else {
    if (lst == SCM_NIL) {
      SCM_call *call = SCM_make_call(1, cont, result);
      SCM_perform_continuation_call(call);
    } else {
      SCM_obj *tmp = SCM_cdr(lst);
      SCM_obj *tmp2 = SCM_fx_plus(result, SCM_car(lst));
      SCM_cont *now = SCM_make_cont(2, &calculate_sum, cont);
      SCM_call *call = SCM_make_call(2, now, tmp, tmp2);
      SCM_perform_continuation_call(call); /* "Recur" */
    }
  }
}

void show_sum_continued(SCM_cont *cont, SCM_obj *result) {
  if (!fits_on_stack(SCM_CONT_SIZE(1) + SCM_CALL_SIZE(1) + SCM_FIXNUM_SIZE)) {
    SCM_save_call(1, &show_sum_continued, cont, result);
    SCM_minor_GC();
  } else {
    SCM_cont *now = SCM_make_cont(1, &SCM_print_number, cont);
    SCM_obj *tmp = SCM_fx_times(SCM_fx(2), result);
    SCM_call *call = SCM_make_call(1, now, tmp);
    SCM_perform_continuation_call(call);
  }
}

void SCM_print_number(SCM_cont *cont, SCM_obj *data) {
  if (!fits_on_stack(SCM_CALL_SIZE(1))) {
    SCM_save_call(1, &show_sum_continued, cont, data);
    SCM_minor_GC();
  } else {
    printf("%d\n", SCM_fx_to_integer(data));
    SCM_call *call = SCM_make_call(1, cont, SCM_VOID);
    SCM_perform_continuation_call(call);
  }
}

Whew! This program is quite a bit longer, but it isn't that different from the second program I showed you. The main change is that none of the continuation functions return anything. In fact, these functions, like Charlie in the M.T.A. song, never return. In the earlier version every function ended with a return statement, now they end with an invocation of SCM_perform_continuation_call().

To make things worse, allocating functions now also use alloca() to place objects on the stack. That means that the stack just keeps filling up like the first compilation I showed you, so we're back to where we started! However, this program is a lot longer due to one important thing: At the start of each continuation function we first check to see if there's enough space left on the stack to accommodate the objects this function will allocate.

If there's not enough space, we re-create the SCM_call with which this continuation function was invoked using SCM_save_call(). This differs from SCM_make_call() in that it will not allocate on the stack, but will use a separate area to set aside the call object. The pointer to that area is stored in saved_cont_call.

SCM_save_call() can't allocate on the stack for a few reasons: The first and most obvious reason is that the saved call wouldn't fit on the stack because we just concluded it is already full. Second, the arguments to the call must be kept around even when the stack is blown away after the GC has finished. Third, this stored call contains the "tip" of the iceberg of live data from which the GC will start its trace. This is described in the next section.

After the minor GC has finished, we can jump back to the trampoline again. We use the setjmp() and longjmp() functions for that. When the first call to SCM_driver_loop() is made, it will call setjmp() to save all the CPU's registers to a buffer. This includes the stack and instruction pointers. Then, when the minor GC finishes, it calls longjmp() to restore those registers. Because the stack and instruction pointer are restored, this means execution "restarts" at the place in driver_loop() where setjmp() was invoked. The setjmp() then returns again, but now with a nonzero value (it was zero the first time). The return value is checked and the call is fetched from the special save area to get back to where we were just before we performed the GC.

This is half the magic, so please make sure you understand this part!

The minor GC

The long story above served to set up all the context you need to know to dive into the GC itself, so let's take a closer look at it.

Picking the "live" data from the stack

As we've seen, the GC is invoked when the stack has completely filled up. At this point, the stack is a complete mess: it has many stack frames from all the function calls that happened between the previous GC and now. These stack frames consist of return addresses for the C function calls (which we're not even using), stack-allocated C data (which we don't need) and somewhere among that mess there are some Scheme objects. These objects themselves also belong to two separate categories: the "garbage" and the data that's still being used and needs to be kept around (the so-called live data). How on earth are we going to pick only the interesting bits from that mess?

Like I said before, the saved call contains the "tip of the iceberg" of live data. It turns out this is all we need to get at every single object which is reachable to the program. All you need to do is follow the pointers to the arguments and the continuation stored in the call. For each of these objects, you copy them to the heap and if they are compound objects you follow all the pointers to the objects stored within them, and so on. Let's take a look at a graphical representation of how this works. In the picture below I show the situation where a GC is triggered just after the second invocation of calculate-sum (i.e., the first recursive call of itself, with the list '(2 3)):

After the initial shock from seeing this cosmic horror has worn off, let's take a closer look. It's like a box of tangled cords: if you take the time to carefully untangle them, it's easy, but if you try to do it all at once, it'll leave you overwhelmed. Luckily, I'm going to talk you through it. (by the way: this is an SVG so you can zoom in on details as far as you like using your browser's zooming functionality).

Let's start with the big picture: On the left you see the stack, on the right the heap after copying and in the bottom centre there's a small area of statically allocated objects, which are not subject to GC. To get your bearings, check the left margin of the diagram. I have attempted to visualise C stack frames by writing each function's name above a line leading to the bottom of its frame.

Let's look at the most recently called function, at the top of the stack. This is the function which initiated the minor GC: the second call to calculate_sum(). The shaded area shows the pointers set aside by SCM_save_call(), which form the tip of the iceberg of live data. More on that later.

The next frame belongs to the first call to calculate_sum(). It has allocated a few things on the stack. The topmost element on the stack is the last thing that's allocated due to the way the stack grows upwards in this picture. This is a pointer to an SCM_call object, marked with "[call]", which is the name of the variable which is stored there. If you go back to the implementation of calculate_sum(), you can see that the last thing it does is allocate an SCM_call, and store its pointer in call. The object itself just precedes the variable on the stack, and is marked with a thick white border to group together the machine words from which it is composed. From bottom to top, these are:

  • A tag which indicates that this is a call containing 2 arguments,
  • a pointer to an SCM_cont object (taken from the now variable),
  • a pointer to an SCM_obj object (the cdr of lst, taken from tmp) and
  • a pointer to an SCM_obj object (a fixnum, taken from tmp2).

Other compound objects are indicated in the same way.

You'll also have noticed the green, white and dashed arcs with arrow tips. These connect pointers to their target addresses. The dashed ones on the right hand side of the stack indicate pointers that are used for local variables in C functions or SCM_call objects. These pointers are unimportant to the garbage collector. The ones on the left hand side of the stack are pointers from Scheme objects to other Scheme objects. These are important to the GC. The topmost pointer inside the call object we just looked at has a big dashed curve all the way down to the cdr of lst, and the one below it points at the value of result, which is the fixnum 1.

If you look further down the stack, you'll see the show_sum procedure which doesn't really allocate much: an SCM_call, the initial intermediate result (fixnum 0), and two continuations (next and now in the C code). The bulk of the allocation happens in toplevel, which contains the call to show_sum and allocates a list structure. This is on the stack in reverse order: first the pair X = (3 . ()), then the pair Y = (2 . <X>) and the pair Z = (1 . <Y>). The () is stored as SCM_NIL in the static area, to which the cdr of the bottom-most pair object on the stack points, which is represented by a long green line which swoops down to the static area.

Copying the live data to the heap

The green lines represent links from the saved call to the live data which we need to copy. You can consider the colour green "contagious": imagine everything is white initially, except for the saved call. Then, each line starting at the pointers of the call are painted green. The target object to which a line leads is also painted green. Then, we recursively follow lines from pointers in that object and paint those green, etc. The objects that were already in the heap or the static area are not traversed, so they stay white.

When an object is painted green, it is also copied to the heap, which is represented by a yellow line. The object is then overwritten by a special object which indicates that this object has been moved to the heap. This special object contains a forwarding pointer which indicates the new location of the object. This is useful when you have two objects which point to the same other object, like for example in this code:

(let ((a (list 3 2 1))
      (b (cons 4 a))
      (c (cons 4 a)))
  ...)

Here you have two lists (4 3 2 1) which share a common tail. If both lists are live at the moment of GC, we don't want to copy this tail twice, because that would result in it being split into two distinct objects. Then, a set-car! on a might only be reflected in b but not c, for example. The forwarding pointers prevent this from happening by simply adjusting a copied object's constituent objects to point to their new locations. Finally, after all data has been copied, all the newly copied objects are checked again for references to objects which may have been relocated after the object was copied.

The precise algorithm that performs this operation is very clever. It requires only two pointers and a while loop, but it still handles cyclic data structures correctly. The idea is that you do the copying I described above in a breadth-first way: you only copy the objects stored in the saved call (without touching their pointers). Next, you loop from the start of the heap to the end, looking at each object in turn (initially, those are the objects we just copied). For these objects, you check their components, and see whether they exist in the heap or in the stack. If they exist in the stack, you copy them over to the end of the heap (again, without touching their pointers). Because they are appended to the heap, the end pointer gets moved to the end of the last object, so the while loop will also take the newly copied objects into consideration. When you reach the end of the heap, you're done. In C, that would look something like this:

SCM_obj *slot;
int i, bytes_copied;
char *scan_start = heap_start;

for(i = 0; i < saved_object_count(saved_call); ++i) {
  obj = get_saved_object(saved_call, i);
  /* copy_object() is called "mark()" in CHICKEN.
     It also set up a forwarding pointer at the original location */
  bytes_copied = copy_object(obj, heap_end);
  heap_end += bytes_copied;
}

while(scan_start < heap_end) {
  obj = (SCM_obj *)scan_start;
  for(i = 0; i < object_size(obj); ++i) {
    slot = get_slot(obj, i);
    /* Nothing needs to be done if it's in the heap or static area */
    if (exists_in_stack(slot)) {
      if (is_forwarding_ptr(slot)) {
        set_slot(obj, i, forwarding_ptr_target(slot));
      } else {
        bytes_copied = copy_object(slot, heap_end);
        set_slot(obj, i, heap_end);
        heap_end += bytes_copied;
      }
    }
  }
  scan_start += object_size(obj);
}

This algorithm is the heart of our garbage collector. You can find it in runtime.c in the CHICKEN sources in C_reclaim(), under the rescan: label. The algorithm was invented in 1970 by C.J. Cheney, and is still used in the most "state of the art" implementations. Now you know why Henry Baker's paper is called "Cheney on the M.T.A." :)

After the data has been copied to the heap, the longjmp() in minor_GC() causes everything on the stack to be blown away. Then, the top stack frame is recreated from the saved call. This is illustrated below:

Everything in the shaded red area below the stack frame for driver_loop() is now unreachable because there are no more pointers from live data pointing into this region of the stack. Any live Scheme objects allocated here would have been copied to the heap, and all pointers which pointed there relayed to this new copy. Unfortunately, this stale copy of the data will permanently stick around on the stack, which means this data is forever irreclaimable. This means it is important that the entry point should consume as little stack space as possible.

The major GC

You might be wondering how garbage on the heap is collected. That's what the major GC is for. CHICKEN initially only allocates a small heap area. The heap consists of two halves: a fromspace and a tospace. The fromspace is the heap as we've seen it so far: in normal usage, this is the part that's used. The tospace is always empty.

When a minor GC is copying data from the stack to the fromspace, it may cause the fromspace to fill up. That's when a major GC is triggered: the data in the fromspace is copied to the tospace using Cheney's algorithm. Afterwards, the areas are flipped: the old fromspace is now called tospace and the old tospace is now called fromspace.

During a major GC, we have a slightly larger set of live data. It is not just the data from the saved call, because that's only the stuff directly used by the currently running continuation. We also need to consider global variables and literal objects compiled into the program, for example. These sorts of objects are also considered live data. Aside from this, a major collection is performed the same way as a minor collection.

The smart reader might have noticed a small problem here: what if the amount of garbage cleaned up is less than the data on the stack? Then, the stack data can't be copied to the new heap because it simply is too small. Well, this is when a third GC mode is triggered: a reallocating GC. This causes a new heap to be allocated, twice as big as the current heap. This is also split in from- and tospace. Then, Cheney's algorithm is performed on the old heap's fromspace, using one half of the new heap as tospace. When it's finished, the new tospace is called fromspace, and the other half of the new heap is called tospace. Then, the old heap is de-allocated.

Some practical notes

The above situation is a pretty rough sketch of the way the GC works in CHICKEN. Many details have been omitted, and the actual implementation is extremely hairy. Below I'll briefly mention how a few important things are implemented.

Object representation

You might have noticed that the stack grows pretty quickly in the CPS-converted C code I showed you. That's because the SCM_obj representation requires allocating every object and storing a pointer to it, so that's a minimum of two machine words per object.

CHICKEN avoids this overhead for small, often-used objects like characters, booleans and fixnums. It ensures all allocated objects are word-aligned, so all pointers to objects have their lower bits set to zero. This means you can easily see whether something is a pointer to an object or something else.

All objects in CHICKEN are represented by a C_word type, which is the size of a machine word. So-called immediate values are stored directly inside the machine word, with nonzero lower bits. Non-immediate values are cast to a pointer type to a C structure which contains the type tag and bits like I did in the example.

Calls are not represented by objects in CHICKEN. Instead, the C function is simply invoked directly from the continuation's caller. Continuations are represented as any other object. For didactic reasons, I used a separate C type to distinguish it from SCM_obj, but in Scheme continuations can be reified as first-class objects, so they shouldn't be represented in a fundamentally different way.

Closures

You might be wondering how closures are implemented, because this hasn't been discussed at all. The answer is pretty simple: in the example code, a SCM_call object stored a plain C function's address. Instead, we could store a closure instead: this is a new type of object which holds a C function plus its local variables. Each C function receives this closure as an extra argument (in the CHICKEN sources this is called self). When it needs to access a closed-over value, it can be accessed from the closure object.

Mutations

Another major oversight is the assumption that objects can only point from the stack into the heap. If Scheme was a purely functional language, this would be entirely accurate: new objects can refer to old objects, but there is no way that a preexisting object can be made to refer to a newly created object. For that, you need to support mutation.

But Scheme does support mutation! So what happens when you use vector-set! to store a newly created, stack-allocated value in an old, heap-allocated vector? If we used the above algorithm, the newly created element would either be part of the live set and get copied, but the vector's pointer would not be updated, or it wouldn't be part of the live set and the object would be lost in the stack reset.

The answer to this problem is also pretty simple: we add a so-called write barrier. Whenever a value is written to an object, it is remembered. Then, when performing a GC, these remembered values are considered to be part of the live set, just like the addresses in the saved call. This is also the reason CHICKEN always shows the number of mutations when you're asking for GC statistics: mutation may slow down a program because GCs might take longer.

Stack size

How does CHICKEN know when the stack is filled up? It turns out that there is no portable way to detect how big the stack is, or whether it has a limit at all!

CHICKEN works around this simply by limiting its stack to a predetermined size. On 64-bit systems, this is 1MB, on 32-bit systems it's 256KB. There is also no portable way of obtaining the address of the stack itself. On some systems, it uses a small bit of assembly code to check the stack pointer. On other systems, it falls back on alloca(), allocating a trivial amount of data. The address of the allocated data is the current value of the stack pointer.

When initialising the runtime, just before the entry point is called, the stack's address is taken to determine the stack's bottom address. The top address is checked in the continuation functions, and the difference between the two is the current stack size.

A small rant

While doing the background research for this post, I wanted to read Cheney's original paper. It was very frustrating to find so many references to it, which all lead to a a paywall on the ACM website.

I think it's absurd that the ACM charges $15 for a paper which is over forty years old, and only two measly pages long. What sane person would plunk down 15 bucks to read 2 pages, especially if it is possibly outdated, or not even the information they're looking for?

The ACM's motto is "Advancing Computing as a Science & Profession", but I don't see how putting essential papers behind a paywall is advancing the profession, especially considering how many innovations now happen as unpaid efforts in the open source/free software corner of the world. Putting such papers behind a paywall robs the industry from a sorely-needed historical perspective, and it stifles innovation by forcing us to keep reinventing the wheel.

Some might argue that the ACM needs to charge money to be able to host high-quality papers and maintain its high quality standard, but I don't buy it. You only need to look at USENIX, which is a similar association. They provide complete and perpetual access to all conference proceedings, and the authors maintain full rights to their work. The ACM, instead, has now come up with a new "protection" racket, requiring authors to give full control of their rights to the ACM, or pay for the privilege of keeping the rights on their own work, which is between $1,100 and $1,700 per article.

On a more positive note, authors are given permission to post drafts of their papers on their own website or through their "Author-izer" service. Unfortunately, this service only works when the link is followed directly from the domain on which the author's website is located (through the Referer header). This is not how the web works: it breaks links posted in e-mail as well as search engines.

Secondly, the ACM are also allowing their special interest groups to provide full access to conference papers of the most recent conference. However, this doesn't seem to be encouraged in any way, and only a few SIGs seem to do this.

Luckily, I found a copy of the Cheney paper on some course website. So do yourself a favour and get it before it's taken down :(

Update: If you are also concerned about this, please take a small moment to add your name to this petition.

Further reading

Garbage collection is a fascinating subject whose roots can be traced back all the way to the origins of LISP. There's plenty of information to be found:


Today I'd like to talk about how CHICKEN Scheme handles distribution of language extensions (which we call "eggs"). There are some unique features of our setup that might be interesting to users of other languages as well, and I think the way backwards compatibility was kept is rather interesting.

In the beginning

First, a little bit of history, so you know where we're coming from. CHICKEN was initially released in the year 2000, and the core system was available as a tarball on the website. In 2002 it was moved into CVS and in 2004 to Darcs (yes, there were good open source DVCSes before Git).

Throughout this period, eggs were simply stored as tarballs (curiously bearing a ".egg" extension) in some well-known directory on the CHICKEN website. The egg installation tool had this location built in. For example, the egg named foo would be fetched from http://www.call-with-current-continuation.org/eggs/foo.egg.

To contribute (or update!) an extension, you simply sent a tarball to Felix and he would upload it to the site. This was a very centralised way of working, creating a lot of work for Felix. So in 2005, he asked authors to put all eggs in a version control system: Subversion. At the time, every contributer was given write access to the entire repo! These were simpler times, when we had only a handful of contributors.

The switch to Subversion allowed for a neat trick: whenever an egg was modified, it triggered a "post-commit hook" which tarred up the egg and uploaded it to the website. This was a very simple addition which automated the work done by Felix, while ensuring the existing tools did not have to be modified. Egg authors now had the freedom to modify their code as they liked, and new releases would appear for download within seconds.

If an author used the conventional trunk/tags/branches layout, the post-commit hook automatically detected this and would upload the latest tag. In other words, we reached a level of automation where "making a release" was simply tagging your code!

Documentation for eggs originally lived on the same website as the eggs did, but this was eventually moved into svnwiki, one of the first wikis to use Subversion as a backing store. To make things even simpler, the core system was also moved into Subversion. Now everything was in one system, for everyone to hack on, using the same credentials everywhere. Life was good!

Start of the DVCS wars

This worked great for years, and the number of contributors steadily increased. Meanwhile, distributed version control systems were gaining mainstream popularity, and contributors started experimenting with Git, Mercurial, Bazaar and Fossil. People grumbled that CHICKEN was still on Subversion.

The next major release, CHICKEN 4.0, provided for a "clean slate", with the opportunity to rewrite the distribution system. This simplified things, replacing the brittle post-commit script with a CGI program called "henrietta", which would serve the eggs via HTTP. The download location for eggs was put into a configuration file, which allowed users to host their own mirror. This is useful if for example a company wants to set up a private deployment-server containing proprietary eggs. We also gained a mirror for general use, graciously provided by Alaric.

The difference was that now there was no static tarball, but when you downloaded an egg, its files would be served straight from either svn, a local directory tree or a website. If we ever decided to migrate the egg repository to a completely different version control system, we could simply add a new back-end to Henrietta. Nothing would have to be modified on the client.

The new system

In 2009, CHICKEN core was moved into a Git repository, as it looked like Git was winning the DVCS wars. New users were often complaining about having to use crusty old Subversion. By this time, people even used DVCSes exclusively, only synchronising to the svn repo. This meant it was no longer the "canonical" repository for all eggs. It was becoming nothing but a hassle for those who preferred other VCSes.

Another problem was that we had still a maintenance problem: commit access on the svn repo is centrally managed, through one big mod_authz_svn configuration file, listing which users have access to which "sub-repositories". If someone wants to grant commit access to another developer, this has to be requested via the mailing list or the server's maintainer.

Requirements

To solve these problems, we started to consider new ways to allow users to develop their eggs using their favorite VCS. The new system had a few strict requirements:

  • It had to be completely backwards-compatible. No changes should be made to CHICKEN core. New eggs published through this system should be available to older CHICKENs, too.
  • It had to be completely VCS-independent. We want to avoid extra work when the next VCS fad comes along. Furthermore, it should work with all popular code hosters, for maximum freedom of choice. Self-hosting should explicitly be an option.
  • The existing workflow of egg authors should not fundamentally change; especially the release procedure of making a tag should stay.
  • There should be a way to avoid broken links if someone takes down their repo.
  • Most of all, the system had to be simple.

A simple solution

The simplest way to make the distribution system VCS-independent is to ignore VCSes altogether! Instead, we download source files over HTTP and mirror them from the CHICKEN server.

This idea was rather natural: our Subversion setup had always allowed direct access to plain files over HTTP through mod_dav_svn. Most popular code hosting sites (Github, Bitbucket, Google Code etc) also allow this, either directly or via some web repo viewer's "download raw file" link, which can be constructed from a VCS tag and file name. Also, Henrietta already supported serving eggs from a local directory tree which meant we had to make almost no modifications to our existing tool chain.

To make this work, all that's needed is:

  • Some daemon which periodically fetches new eggs.
  • A "master list" of where each egg is hosted.
  • For each egg, a list of released versions for that egg.
  • A "base URI" where the files for that release can be downloaded.
  • A list of files for that release (or the name of a tarball, which is equivalent).

We already had a so-called ".meta-file", which contains info about the egg (author, license, name, category etc). In an earlier incarnation of the post-commit hook this file also contained a list of the files that the egg consisted of, so it made sense to re-use this facility.

We only needed to take care of the daemon, the master egg list and a way to communicate the base URI. This was simple, and I wrote the daemon (dubbed "henrietta-cache") over a weekend during a hackathon. It really is simple and consists of only 300+ lines of (rather ugly) Scheme code. At the hackathon, Moritz helped out by moving the existing eggs to this new scheme, and testing with various hosting providers.

But not the simplest solution

The clever reader has probably already noted that the setup could be simplified by putting the henrietta-cache logic into the client program. We chose not to do this because it would break two requirements: that of backwards compatibility and that of avoiding broken links.

Strictly speaking, the backwards compatibility problem could be solved by embedding the functionality into chicken-install and eventually removing henrietta-cache from the server.

Broken links are a bigger problem, though. Currently, if a repo becomes unavailable, this is no problem; we still have a cached copy on our servers. Even if the repo goes offline forever and nobody has a copy of it anymore, we can still import the cached files into a fresh repo and take over maintenance from there.

Some incremental improvements

Unfortunately, the new system made it easier for Github and Bitbucket users than for CHICKEN Subversion users to maintain their eggs, because these sites allow tarball downloads, while the Subversion users had to list each file in their egg in the meta file. Under the old system this was not required, because it simply offered the entire svn egg directory for download.

After some people complained about having to do this extra step, I wrote another simple "helper" egg with the tongue-in-cheek name "pseudo-meta-egg-info". This is a small (80 lines) Spiffy web application which can generate "pseudo" meta files containing a full list of all the files in a Subversion subdirectory, and a list of all the tags available. This all happens on-the-fly, which means that egg authors could now revert to their old workflow of simply tagging their egg to make a release!

Technically, this helper webapp can be extended and deployed for any hosting site, so if you decide to host your own repository it could generate the list of tags and files for that, too. CHICKEN isn't big enough to ask Google, Github or Bitbucket to run this on their servers, of course, so some helper plug-ins and shell scripts for svn, hg and git were made as well. These will generate the list of tags and file names and put them in the meta- and release-info files.

Current status

The new system has been in use for over two years (since March 29th, 2011) and it has been doing a good job, requiring only very little maintenance and few modifications after the initial release. We've already reaped the benefits of our setup: Github and Bitbucket both had several periods of downtime, during which eggs were still available, even if they were hosted there.

The following graph shows the number of available CHICKEN eggs, starting with the "release 4" branch (requires an SVG-capable browser). There's a small skew because the script I used to generate the graph only checked for existence, not whether the egg was released.

As you can see, Mercurial (hg) and Git took off almost simultaneously, but where git is still steadily increasing in popularity, hg mostly stagnated. Subversion (svn) saw a few drops from eggs that were moved into hg/git. You'd guess that most git users would use Github, but it turns out that Bitbucket is reasonably popular among Chicken users too. We also have three authors who have opted to host their own repositories. You can see this in the breakdown of today's eggs by host:

Hosting site VCSes # of eggs
code.call-cc.org svn 454
github.com git 85
bitbucket.org hg, git 41
gitorious.org git 5
chust.org fossil 5
kitten-technologies.co.uk fossil 3
code.stapelberg.de git 1

Finally, the graph shows that people are still releasing new eggs from svn, but most new development takes place in git. And yes, there are a few eggs in Fossil, too! Bazaar is currently not listed. One possible explanation is that Loggerhead (its web viewer) does not allow easy construction of stable URLs to raw files for a particular tag (or zip file/tarball), so serving up eggs straight from a repo is not possible. Another reason could be that bzr simply isn't that popular among CHICKEN users. If you're a bzr user and would like to use this distribution scheme, please have a look at Loggerhead issues #473691 and #739022. If you know a way around this, please share your knowledge on our release instructions page.

Things to improve

Needless to say, I'm rather happy that the system satisfied all the requirements we set for it, and that it saw such uptake. The majority of newly released eggs are using one of the new systems (too bad it's Git, but I guess that's inevitable).

However, as always, there is room for improvement. The current system has a few flaws, all of which are due to the fact that henrietta-cache simply copies code off an external site:

  • There's no "stable" tarball per egg release. This is required for OS package managers, which usually verify with a checksum whether the source package has not changed. Recently, Mario improved on this situation by providing tarballs, but these are merely tarballs of the henrietta-cache mirror on that particular server. However, these should be expected to be stable...
  • If an egg author moves tags around, nobody will know. Different henrietta-cache mirrors may then have an inconsistent view of the distributed repository. We have two egg mirrors, and so far this has happened once or twice. This requires some manual intervention: just blow away all the cached files and wait for it to re-synch, or trigger the synch manually.
  • Egg authors cannot sign their eggs; each egg is downloaded from a source that may not be trustworthy. This is tricky, especially because most people don't want to mess around with PGP keys anyway. CHICKEN core releases aren't signed either, so this isn't very high on our priority list.

I think some of these problems are a result of "going distributed", similar to the problem that you should not rewrite history that has already been pushed.


Recently there was a small flame war on the Chicken-hackers mailing list. A user new to Scheme asked an innocuous question that drew some heated responses:

 Is there a good reason for this behavior?
 # perl -e 'print substr("ciao",0,10);'
 ciao 
 # ruby -e 'puts "ciao"[0..10]'
 ciao
 # python -c 'print "ciao"[0:10];'
 ciao
 # csi -e '(print (substring "ciao" 0 10))'
 Error: (substring) out of range 0 10

Some popular dynamic languages have a generic "slice" operator which allows the user to supply an end index that's beyond the end of the object, and it'll return from the start position up until the end. Instead, Chicken (and most other Schemes) will raise an error.

On the list, I argued that taking characters 0 through 10 from a 3-character string makes no bloody sense, which is why it's signalling an error. For the record: this can be caught by an exception handler, which makes it a controlled error situation, not a "crash".

Our new user retorted that it's perfectly sane to define the substring procedure as:

 Return a string consisting of the characters between the start position
 and the end position, or the end of the string, whichever comes first.

I think this is a needlessly complex definition. It breaks the rule "do one thing and do it well", from which UNIX derives its power: Conceptually crisp components ease composition.

One of the most valuable things a programming language can offer is the ability to reason about code with a minimum of extra information. This is also why most Schemers prefer a functional programming style; it's easier to reason about referentially transparent code. Let's see what useful facts we can infer from a single (substring s 0 10) call:

  • The variable s is a string.
  • The string s is at least 10 characters long.
  • The returned value is a string.
  • The returned string is exactly 10 characters long.

If either of the preconditions doesn't hold, it's an error situation and the code following the substring call will not be executed. The above guarantees also mean, for example, that if later you see (string-ref s 8) this will always return a character. In "sloppier" languages, you lose several of these important footholds. This means you can't reason so well about your code's correctness anymore, except by reading back and dragging in more context.

Finally, it is also harder to build the "simple" version of substring on top of the complex one than it is to build the complex one as a "convenience" layer on top of the simpler one. On our list it was quickly shown that it's trivial to do so:

(define (substring/n s start n)
  (let* ((start (min start (string-length s)))
         (end (min (string-length s) (+ start n))))
    (substring s start end)))

;; Easy to use and re-use:
(substring/n "ciao" 1 10) => "iao"

There's even an egg for Chicken called slice which provides a generic procedure which behaves like the ranged index operator in Python/Ruby.

A tangential rant on the hidden costs of sloppiness

The difference in behaviour between these languages is not a coincidence: it's a result of deep cultural differences. The Scheme culture (and in some respects the broader Lisp culture) is one that tends to prefer correctness and precision. This appears in many forms, from Shivers' "100% correct solution" manifesto to Gabriel's Worse Is Better essay and all the verbiage dedicated to correct "hygienic" treatment of macros.

In contrast, some cultures prefer lax and "do what I mean" over rigid and predictable behaviour. This may be more convenient for writing quick one-off scripts, but in my opinion this is just asking for trouble when writing serious programs.

Let's investigate some examples of the consequences of this "lax" attitude. You're probably aware of the recent discovery of several vulnerabilities in Ruby on Rails. Two of these allowed remote code execution simply by submitting a POST request to any Rails application. As this post explains, the parser for XML requests was "enhanced" to automatically parse embedded YAML documents (which can contain arbitrary code). My position is that YAML has absolutely nothing to do with XML (or JSON), which means that if a program wants to parse YAML embedded in XML it must do that itself, or at least explicitly specify it wants automatic type conversions in XML/JSON documents. The Rails developers allowed misplaced convenience and sloppiness to trump precision and correctness, to the point that nobody knew what their code really did.

Another example would be the way PHP, Javascript, and several other languages implicitly coerce types. You can see the hilarious results of the confusion this can cause in the brilliant talk titled "Wat". There are also people filing bug reports for PHP's == operator. Its implicit type conversion is documented, intended, behaviour but it results in a lot of confusion and, again, security issues, as pointed out by PHP Sadness #47. If you allow the programmer to be sloppy and leave important details unspecified, an attacker will gladly fill in those details for you.

Some more fun can be had by looking at the MySQL database and how it mangles data. The PostgreSQL culture also strongly prefers correctness and precision, whereas MySQL's culture is more lax. The clash between these two cultures can be seen in a thread on the PostgreSQL mailinglist where someone posted a video of a comparison between PostgreSQL and MySQL's behaviour. These cultural differences run deep, as you can tell by the responses of shock. And again, the lax behaviour of MySQL has security implications. The Rails folks have discovered that common practices might allow attackers to abuse MySQL's type coercion. Because Rails supports passing typed data in queries, it's possible to force an integer in a condition that expects a string. MySQL will silently coerce non-numerical strings to zero:

SELECT * FROM `users` WHERE `login_token` = 0 LIMIT 1;

This will match the first record (which usually just happens to be the administrative account). Just as with the innocent little substring behaviour we started our journey with, it is possible to work around this, but things would be a lot easier if the software behaved more rigidly and strict, so that this kind of conversion would only be done upon explicit request of the programmer.

Incidentally, it is possible to put MySQL into a stricter "SQL mode":

SET sql_mode='TRADITIONAL';

This is rarely done, probably because most software somehow implicitly relies on this broken behaviour. By the way, does anyone else think it's funny that this mode is called "traditional"? As if it were somehow old-fashioned to expect precise and correct behaviour!

Take back control

It is high time people realised that implicit behaviour and unclear specifications are a recipe for disaster. Computers are by nature rigid and exact. This is a feature we should embrace. Processes in the "real world" are often fuzzy and poorly defined, usually because they are poorly understood. As programmers, it's our job to keep digging until we have enough information to describe the task to a computer. Making APIs fuzzier is the wrong response to this problem, and a sign of weakness. Do you prefer to know exactly what your program will do, or would you rather admit defeat and allow fuzziness to creep into your programs?

In case you're wondering, this rant didn't come out of the blue. One of three reasons this blog is called more magic is as a wry reference to the trend of putting more "magic" into APIs, which makes them hard to control. This is a recurring frustration of mine and I would like to see a move towards less magic of this kind. Yeah, I'm a cynical bastard ;)


I've finally decided to get a proper domain name: http://www.more-magic.net. Please update your bookmarks and feed readers!

I used to run this blog on a hostname from the good folks at DynDNS, which I registered in my college days. DynDNS had the benefit of being 100% free (great for poor college students!), but the disadvantage of having to run a tool called ddclient. This tool is intended to update DNS entries for hosts with dynamically assigned IP address, and if you don't run it, your hostname will expire.

Occasionally ddclient gets "stuck", not performing updates anymore. This happens unnoticably, until you get an e-mail from DynDNS stating that your domain will expire in 5 days unless you click the reactivation link and restart ddclient. The hassle of this and the risk of ddclient getting stuck at a bad time, together with the unprofessional quality of running under a domain that's obviously not your own (and harder to remember) finally got me to consider paying for a proper domain. So there you have it: more-magic.net :)


Last time I explained how sloppy representations can cause various vulnerabilities. While doing some research for that post I stumbled across NUL byte injection bugs in two projects. Because both have been fixed now, I feel like I can freely talk about them with a clear conscience.

These projects are Chicken Scheme and the C implementation of Ruby. The difference in the way these systems deal with NUL bytes clearly shows the importance of handling security issues in a structural way. We'll also see the importance of truly grokking the problem when implementing a fix.

A quick recap

Remember that C uses NUL bytes to delimit strings. Many other languages store the length of the string instead. In these languages, NUL bytes can occur inside strings. This can cause unintended reinterpretation when strings cross the language border into C.

In my previous post I already pointed out how Chicken automatically prevents this reinterpretation in its foreign function interface (FFI). You just describe to Scheme that your C function accepts a string, and it will take care of the rest:

(define my-length (foreign-lambda int "strlen" c-string))

 ;; Prints 12:
(print (my-length "hello, there"))

;; Raises an exception, showing the following message:
;; Error: (##sys#make-c-string) cannot represent string with NUL
;;   bytes as C string: "hello\x00there"
(print (my-length "hello\x00there"))

The FFI's feature of automatically checking for NUL bytes in strings before passing them on to C was only added in late 2010 (Chicken 4.6.0). However, because everything uses this interface, this mismatch could easily be fixed, in a central location, securing all existing programs in one fell swoop.

Now, you may be thinking "well, that's nothing special; it's good engineering practice that there must be a single point of truth, and that you Don't Repeat Yourself". And you'd be right! In fact, this is a key insight: solid engineering is a prerequisite to secure engineering. It can prevent security bugs from happening, and help to fix them quickly once they are discovered. A core tenet of "structural security" is that without structure, there can be no security.

When smugness backfires

To drive home the point, let's take a look at what I discovered while writing my previous blog post. After describing Chicken's Right Way solution and feeling all smug about it, I noticed an embarrassing problem: for various reasons (some good, others less so), there are places in Chicken where C functions are called without going through the FFI. Some of these contained hand-rolled string conversions!

It turns out that we overlooked these places when first introducing the NUL byte checks, and as a consequence several critical procedures (standard R5RS ones like with-input-from-file) were left vulnerable to exactly this bug:

;; This program outputs "yes" twice in Chickens < 4.8.0
(with-output-to-file "foo\x00bar" (lambda () (print "hai")))
(print (if (file-exists? "foo") "yes" "no"))
(print (if (file-exists? "foo\x00bar") "yes" "no"))

To me, this just validates the importance of approaching security measures in a structural rather than an ad-hoc way; the bug was only in those parts of the code that didn't use the FFI. Deviation from a rule is where bugs are often found!

You can also see that we fixed it as thoroughly as possible, especially given the at times awkward structure of the Chicken code. We commented every special situation extensively, assigned a new error type C_ASCIIZ_REPRESENTATION_ERROR for this particular error, and added regression tests for at least each class of functionality (string to number conversion, file port creation, process creation, environment access, and low-level messaging functionality). There's definitely room for improvement here, and I hope to one day reduce the special cases to the bare minimum. By documenting special cases it's easy to avoid introducing new problems. It also makes them easier to find when refactoring. The tests help there too, of course.

When you run the above program in a Chicken version with the fix, it behaves like expected:

 Error: cannot represent string with NUL bytes as C string: "foo\x00bar"

Another approach

The Ruby situation is a little more complicated. It has no FFI but a C API, so it works the other way around: you write C to interface "up" into Ruby. It has a StringValueCStr() macro, which is documented as follows (sic):

 You can also use the macro named StringValueCStr(). This is just
 like StringValuePtr(), but always add nul character at the end of
 the result. If the result contains nul character, this macro causes
 the ArgumentError exception.

However, this isn't consistently used in Ruby's own standard library:

File.open("foo\0bar", "w") { |f| f.puts "hai" }
puts File.exists?("foo")
puts File.exists?("foo\0bar")

In Ruby 1.9.3p194 and earlier, this shows the following output, indicating it's vulnerable:

 true
 test.rb:4:in `exists?': string contains null byte (ArgumentError)
         from test.rb:4:in `<main>'

It turns out that internally, Ruby strings are stored with a length, but also get a NUL byte tacked onto the end, to prevent copying when calling C functions. This performance hack undermines the safety of Ruby to C string conversions, and is the direct cause of these inconsistencies. True, there is a safe function that extracts the string while checking for NUL bytes, but there are also various ways to bypass this, and if you accidentally use the wrong macro to extract the (raw) string, your code won't break. Of course, this is only true for benign inputs...

The complexity of Ruby's implementation makes it hard to ensure that it's safe everywhere. Indeed, the various places where strings are passed to C all do it differently. For example, the ENV hash for manipulating the POSIX environment has its own hand-rolled test for NUL, which you can easily verify; it produces a different error message than the one exists? gave us earlier:

irb(main):001:0> ENV["foo\0bar"] = "test"
ArgumentError: bad environment variable name

There is no reason this couldn't just use StringValueCStr(). So, even though Ruby has this safe macro, which provides a mechanism to check for poisoned NUL bytes in strings, it's rarely used by Ruby's own internals. This could be fixed just like Chicken; here too, the best way to do that would be to generalize and eliminate all special cases. Simpler code is easier to secure.

A fundamental misunderstanding

When I reported the bug in the File class to the Ruby project, they quickly had a fix, but unfortunately they seemed uninterested in going through Ruby's entire code to fix all string conversions (quoting from private e-mail conversation):

 > I agree that this looks like a good place to fix the File/IO
 > class, but there are many other places where strings are passed to C.
 > Are all of those secured?
 All path names should be converted with "to_path" method if possible.
 If any methods don't obey the rule, it is another bug.  Please let us
 know if you find such case.

In retrospect, there is the possibility that I didn't quite make myself clear enough. Perhaps this person thought I was referring to other path strings in the code. However, to me it sounds a lot like they made the same conceptual mistake that the PHP team made when they "fixed" NUL injections.

The PHP solution was to add a special "p" flag for converting path strings. This happens for all PHP functions declared in C (via zend_parse_parameters()). By the way, notice how this is a new flag. There probably are tons of PHP extensions out there which aren't using this flag yet. Also, who can verify that they managed to find all the strings in PHP which represent paths?

The PHP team was completely missing the point here. This fix means that path arguments aren't allowed to have embedded NUL bytes. Other string type arguments are not checked. They are missing the fact that this isn't just a path issue. Rather, as I described before, it's a fundamental mismatch at the language boundary where strings are translated from the host language to C. However, there seems to be a widespread belief that this can only be exploited in path strings.

I'm not entirely sure why this is, but I can guess. First off, "poisoned NUL byte" attacks have been popularized by a 1999 Phrack article. This article shows a few attacks, but only the path examples are really convincing. Of course, another reason is that injecting NUL bytes in path strings really is the most obvious and practical way to exploit web scripts.

Recently, however, different NUL byte attacks have been documented. For example, they can be used to truncate LDAP and SQL queries and to bypass regular expression filters on SQL input, but you could argue these are all examples of failure to escape correctly. I found a more convincing example in the (excellent!) book The Tangled Web: it contains a one-sentence warning about using HTML sanitation C libraries from other languages. Also, NUL bytes can sometimes be used to hide attacks from log files.

However, the most impressive recent exploit is without a doubt this common vulnerability in SSL certificate verification systems. In an attack, an embedded NUL byte causes a certificate to be accepted for "www.paypal.com", when the CN (Common Name) section (that is, the server's hostname) actually contains the value "www.paypal.com\0.thoughtcrime.org". Certificate authorities generally just accepted this as a valid subdomain of "thoughtcrime.org", ignoring the NUL byte. Client programs (like web browsers) tended to use C string comparison functions, which stop at the NUL byte. Luckily, this was widely reported, and has been fixed in most programs.

I believe that NUL byte mishandling represents a big and mostly untapped source of vulnerabilities. High-level languages are gaining popularity over C for client-side programs, but many crucial libraries are still written in C. This combination means that the problem will grow unless this is structurally fixed in language implementations.


Older articles...
Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution 3.0 License. All code fragments on this site are hereby put in the public domain.