About this blog

Hi, I'm Peter Bex, a Scheme and free software enthusiast from the Netherlands. See my user page on the CHICKEN wiki for some of my projects.


Are you in need of support for CHICKEN Scheme? Or maybe you want clear tech writing like on this very blog? Then you're in luck! I am now available as a freelance consultant!

The 3 most recent posts (archive) Atom feed

I just submitted a patch to add a statistical profiler to CHICKEN. In this post, I'll explain how it works. It's easier than you'd think!

Instrumentation-based profiling

There are two major ways to profile a program. The first way has been supported by CHICKEN as long as I can remember: You add instrumentation to each procedure. This counts how often the procedure is called, and how much time is spent in it.

You've probably done this by hand: You check the clock before and after calling a procedure, and print the difference. This can be useful when whittling down a specific procedure's run time. But when you want to know where the bottlenecks are in the first place, it's less practical. You don't want to manually add this stuff to all your procedures!

In order to easily instrument each procedure, you'll need language support, either in the compiler or in the run time. Unfortunately, the instrumentation itself will cause your program to slow down: all this tracking takes some time! That's why CHICKEN's profiler is part of the compiler: Instrumentation is emitted only when you compile with -profile. This option adds wrappers around each procedure, which look like this:

;; Original source code:
(define (foo a b c)
  (print (+ a b) c))

;; Instrumented version created by -profile:
(define foo
  (lambda args
    (dynamic-wind  ; Explained later
      (lambda () (##sys#profile-entry 0 profile-info-1234.4567))
      (lambda () (apply (lambda (a b c) (print (+ a b) c)) args))
      (lambda () (##sys#profile-exit 0 profile-info-1234.4567)) ) ) )

In the above code, ##sys#profile-entry starts the clock and increments the call count for this procedure, and ##sys#profile-exit stops the clock. The profile-info-1234.4567 is a global vector in which each procedure of this compilation unit is assigned a unique position.

To create the vector and assign the positions, a prelude is added to the compilation unit. This defines the vector and registers a position for each procedure:

;; Prelude for entire program:
(define profile-info-1234.4567 (##sys#register-profile-info 1 #t))
(##sys#set-profile-info-vector! profile-info-1234.4567 0 'foo)

;; Not exactly true, but let's pretend because it's close enough:
(on-exit ##sys#finish-profile)

This simply creates a vector of size one, and assigns the foo procedure to position zero. Then it requests ##sys#finish-profile to run when the program exits. This will write profile information to disk on exit.

We need dynamic-wind, but it creates problems

If you're not familiar with it, dynamic-wind is a sort of "guard" or "try/finally". Whenever the second lambda is entered, the first lambda ("before") is invoked, and when it is left, the third lambda ("after") is invoked. To understand why we need a dynamic-wind in the code presented earlier, consider the naive, incorrect implementation:

;; INCORRECT expansion:
(define foo
  (lambda args
    (begin
      (##sys#profile-entry 0 profile-info-1234.4567)
      (apply (lambda (a b c) (print (+ a b) c)) args)
      ;; Not reached if + throws an exception.  This
      ;; will happen when a or b are not numbers!
      (##sys#profile-exit 0 profile-info-1234.4567) ) ) )

Here, the "after" part will be skipped if the procedure raises an exception. Furthermore, a continuation might be captured or called in the procedure, even multiple times. This would cause the code to jump in and out of the procedure without neatly going over the before/after bits every time. Because of this, the profiler would miss these exits and re-entries, and hence it would not be able to accurately keep track of the time actually spent in this procedure.

The code with dynamic-wind will take care of non-local exits, stopping the clock whenever we jump out of the procedure, and starting it when we jump in again via a captured continuation.

While dynamic-wind is necessary, it also implies quite a heavy hit on performance: dynamic-wind isn't cheap. Even worse is the fact that it prevents the compiler from inlining small procedures in larger ones. Furthermore, the use of apply implies we'll need to cons up an argument list. Normally, arguments aren't put into a Scheme list, because doing so results in more garbage being created. This means that the performance shape of the profiled application can be quite different from the original, non-profiled application!

Statistical profiling

I've always wanted to look into fixing the profiler, but never had the energy to do so. Now that Felix wrote an entire graphical debugger for CHICKEN, I thought maybe I could lean on the debugger's infrastructure to make a better profiler. But it turned out I didn't have to!

First, I should explain statistical profiling. The basic idea is that the process is periodically sampled while it's running. These samples are taken by inspecting the instruction pointer and mapping it to a procedure. If you do this often enough (every 10 ms or so), you can get a pretty good idea of where the program is spending most of its time.

CHICKEN's trace buffer

Looking at the instruction pointer or C function is not very useful in CHICKEN, unless you like to pore over endless piles of machine-generated C code and to mentally map it back to Scheme. It can be educational and even fun, just like it can be fun and educational to read the assembly output of a C compiler, but it is generally unproductive and gets frustrating quickly.

So how can we take a snapshot of what a Scheme program is doing at any given time? It turns out that CHICKEN already does this: when a Scheme program raises an exception, the interpreter will show you a call trace. This is a bit like a stack trace in a "traditional" language, only it shows a trace of the execution flow that led to the error. Let's look at a contrived example:

(define (fib n)
  (if (< n 2)
      n
      (+ (fib (- n 1)) (fib (- n 2))) ) )

(define (run-fib)
  (let* ((n (string->number (car (command-line-arguments))))
         (result (fib n)))
    ;; This line is wrong: It tries to append a number to a string
    (print (string-append "Result: " result)) ) )

(run-fib)

When you compile the program and run it, you'll see the output:

 $ csc -O3 fib.scm
 $ ./fib 3
 
 Error: (string-append) bad argument type - not a string: 2
 
         Call history:
         
         fib.scm:12: run-fib
         fib.scm:7: command-line-arguments
         fib.scm:8: fib
         fib.scm:4: fib
         fib.scm:4: fib
         fib.scm:4: fib
         fib.scm:4: fib
         fib.scm:10: string-append               <--

In a stack trace you would see only two lines, mentioning run-fib and string-append. Here we can see the trace of execution through the program, where it entered run-fib, then called command-line-arguments, proceeded to invoke fib five times in total, and finally it invoked string-append with the result.

There are advantages and disadvantages to either approach, but a trace makes more sense in languages with full continuations. It is the natural thing to do in a compiler that converts all programs to continuation-passing style. It does tend to confuse beginners, though!

So, these trace points are already inserted into every program by default. This happens even in programs optimised with -O3, because trace points have very little overhead: It's just a pointer into a ring buffer that gets updated to point to the procedure's name. You can choose to omit traces completely via -no-trace or -d0, but that's not the default.

Trace points are a good fit for our profiler: when taking a sample, we can simply take a look at the newest entry in the trace buffer, which will always reflect the procedure that's currently running!

Setting up a sampler

So how can we interrupt the program at a point in time, without messing with the program's state? This is simple: we ask for a signal to be delivered periodically. There's even a dedicated signal reserved for this very task: SIGPROF.

We'll use setitimer() to set up the timer which causes the signal to be delivered, even though POSIX says it's obsolete. It is much more convenient and more widely supported than the alternative, timer_create() plus timer_settime(). We can always switch when setitimer() is removed from an actual POSIX implementation.

The following setup code is simplified a bit (most importantly, error handling is omitted):

C_word CHICKEN_run(void *toplevel) { /* Initialisation code for CHICKEN */
  /* ... a lot more code ... */

  if (profiling) {
    struct itimerval itv;
    time_t freq = 10000;                   /* 10 msec (in usec) */

    itv.it_value.tv_sec = freq / 1000000;
    itv.it_value.tv_usec = freq % 1000000;
    itv.it_interval.tv_sec = itv.it_value.tv_sec;
    itv.it_interval.tv_usec = itv.it_value.tv_usec;

    setitimer(ITIMER_PROF, &itv, NULL);
    signal(SIGPROF, profile_signal_handler);
  }

  /* ... a lot more code ... */
}

This sets up a profile timer. When such a timer expires, the kernel will send SIGPROF to the process. The code also registers a signal handler that will be invoked when this happens. It looks like this, much simplified:

struct profile_item           /* Item in our profiling hash table */
{
  char *key;                  /* Procedure name, taken from trace buffer */
  unsigned int sample_count;  /* Times this procedure was seen */
  struct profile_item *next;  /* Next bucket chain */
};

void profile_signal_handler(int signum)
{
  struct profile_item *pi;
  char *procedure_name;

  procedure_name = get_topmost_trace_entry();

  pi = profile_table_lookup(procedure_name);

  if (pi == NULL) {
    pi = profile_table_insert(procedure_name);
    pi->sample_count = 1;
  } else {
    pi->sample_count++;
  }
}

Maybe you are wondering why we're doing this in C rather than Scheme. After all, Scheme code is more readable and easier to maintain. There are a few reasons for that:

  • Scheme signal handlers are blocked in some critical sections of the run time library. This would delay profiling until the program is back in user code, skewing the results.
  • Core is compiled with -no-trace by default, but this can be turned off. Doing so can be useful when profiling core procedures, but not with a signal handler in Scheme. It would see its own code in the trace buffer, instead of what we want to trace!
  • The profiling code should be as low-overhead as possible, to avoid affecting the results. Remember, this is one of the main problems with the instrumenting profiler! While CHICKEN produces fast code, it is faster if we do it directly in C, and it will trigger no garbage collections.

After the program has completed its run, we must write the profile to a file. I won't show it here, but the CHICKEN implementation simply writes each key in the hash table with its call count and the time spent in that procedure. The time is estimated by multiplying the sampling frequency by the call count.

This means we'll miss some calls, and therefore we'll under-represent the time taken by those procedures. On the other hand, some procedures are over-represented: if a sample is taken for a very fast procedure, we'll assign 10ms to it, even if it runs in a fraction of that. This is the essence of the statistical approach: if a program runs long enough, these measurement errors should balance out.

Comparing the two profilers

It's hard to come up with a small but representative example which is self-contained, so I'll use a few benchmarks from Mario's collection.

Low-level vs high-level

The first benchmark we'll profile is the kernwyk-array benchmark. It's taken from a historical set of benchmarks by Kernighan and Van Wyk, designed to compare the performance of various "scripting languages". This particular benchmark creates an array containing a million numbers by destructively initialising it. After that, it creates a second array into which the first is copied. This is repeated 100 times.

If we compile this with csc -O3 -profile and run it, the original profiler gives us the following breakdown:

procedure calls seconds average percent
my-try 100 4.008 0.040 100.000
go 1 4.008 4.008 100.000
create-y 100 2.311 0.023 57.684
create-x 100 1.696 0.016 42.315

If we compile this with csc -O3 and run it with -:p under the new profiler, we'll get a radically different result, even though the total run time did not change much:

procedure calls seconds average percent
kernwyk-array.scm:12:make-vector 100 2.400 0.024 58.679
kernwyk-array.scm:5:make-vector 100 1.690 0.016 41.320

The statistical profiler is a bit more "low-level": It tells you exactly the procedure call and line that is taking the most time. On the other hand, the instrumentation-based profiler shows us a breakdown in which procedure the most time is spent. The percentages and "seconds" column are also different: the original profiler shows us the cumulative time each procedure takes up. Thus, a main entry point will always be at 100% at the top.

But the most significant difference is in what this tells us about where the time is spent: the original profiler tells us that create-y is a little slower than create-x. Reading such output would lead me to think that probably vector-ref and vector-set! take the most time. If we remove all calls to those, the program takes 2.6 seconds, and the profiler output looks more or less the same, so they're not the biggest contributor to the total run time. Instead, make-vector is, due to the fact that it allocates, which will cause garbage collections. And garbage collections are the real time consumers in this benchmark!

Precision of the two profilers

The next benchmark we'll look at is the nfa benchmark. I'm not sure about the origins of this benchmark. It emulates a depth-first NFA search for the regular expression ((a | c)* b c d) | (a* b c). This is matched against a string containing 133 "a" characters followed by "bc".

The output of the instrumenting profiler is completely useless, because only toplevel procedures are instrumented:

procedure calls seconds average percent
recursive-nfa 150000 8.071 0.000 100.000

This is another reason for wanting to "fix" the profiler: it doesn't give an inside view of where large procedures or closures are spending their time. You can manually tweak the programs by lifting all the inner procedures up to the toplevel. If these procedures close over some variables, you must turn those into extra arguments and pass them along when calling these procedures. It's a bit tedious, but if we do this for the benchmark, we'll get output that is more useful:

procedure calls seconds average percent
recursive-nfa 150000 30.687 0.00020458 100.000
state0 150000 30.559 0.00020373 99.582
state1 20100000 23.372 0.00000116 76.160
state2 20100000 8.423 0.00000041 27.450
state3 20100000 7.080 0.00000035 23.070
state4 150000 0.132 0.00000088 0.430

Now, if we take the original, untweaked program, and run it through the statistical profiler, we'll immediately get useful output:

procedure calls seconds average percent
nfa.scm:14:state1 146 5.830 0.039 72.602
nfa.scm:31:state3 141 2.160 0.015 26.899
nfa.scm:9:state1 2 0.020 0.010 0.249
nfa.scm:44:##sys#display-times 1 0.010 0.010 0.124
nfa.scm:9:state3 1 0.010 0.010 0.124

This also shows a disadvantage of the statistical profiler: the call counts are all wrong! That's because the state procedures are extremely fast: in the original profiler you can see that they run 20 million times in 8 seconds or so. Because they're so fast, the average time per call is close to zero. This results in the timer being too slow for sampling each procedure while it is running. It's so slow we only see an extremely small fraction of all calls!

Nevertheless, we can clearly tell that most time is spent in state1 and state3. The calls for state2 are not even registering, because this state will return much sooner: there are almost no b, c or d characters in the input pattern, so it will just quickly "fall through" this procedure without a match. The reason it shows up in the original profiler is because the instrumentation itself is interfering with an accurate reading of time spent in the procedure.

The total run time of the instrumented and tweaked version is almost 31 seconds, while the total run time of the version with statistical profiling is less than 8 seconds on my laptop! Let's take a closer look at that overhead.

Instrumentation overhead

Having two distinct ways of gathering profile data opens up a really cool opportunity. We can measure the overhead introduced by the instrumentation profiler, by running it under the statistical profiler!

Let's do that on the "tweaked" NFA benchmark again:

 $ csc -O3 -profile nfa.scm
 $ ./nfa -:p
 30.88s CPU time, 2.404s GC time (major),
 163050000/149475510 mutations (total/tracked),
 5972/113664 GCs (major/minor)

And let's look at the result:

procedure calls seconds average percent
##sys#profile-entry 701 17.000 0.024 55.051
##sys#profile-exit 633 10.710 0.016 34.682
##sys#dynamic-wind 145 1.660 0.011 5.375
nfa.scm:15:state2 61 0.670 0.010 2.169
nfa.scm:29:state3 41 0.440 0.010 1.424
nfa.scm:12:state1 31 0.340 0.010 1.101
nfa.scm:41:state0 3 0.040 0.013 0.129
nfa.scm:49:recursive-nfa 2 0.020 0.010 0.064

This clearly shows that the instrumentation machinery completely dominates the profile: A stunning 27 seconds are being soaked up by ##sys#profile-entry and ##sys#profile-exit!

Of course, this is just a contrived example of profiling a benchmark. An experienced Chickeneer would have been able to just tell from the output of the benchmark with and without profiling:

 $ csc -O3 nfa.scm
 $ ./nfa
 8.156s CPU time, 0.012s GC time (major), 32/6849 GCs (major/minor)
 $ csc -O3 -profile nfa.scm
 $ ./nfa
 30.628s CPU time, 2.316s GC time (major),
 163050000/149475510 mutations (total/tracked),
 5970/113666 GCs (major/minor)

Aside from the obvious increase in CPU time, you can see that the number of mutations went from zero to more than a hundred million. The number of garbage collections is also many times higher. This would spell certain doom if you saw it in a real program!

Conclusion

It's too early to be sure, but it looks like a statistical profiler is a useful alternative to instrumentation. On the other hand, for some programs the situation is reversed:

  • Programs blocking SIGPROF offer fewer sampling opportunities, resulting in incomplete profiles.
  • If you need the exact number of procedure calls, instrumentation is your only real option.
  • When there are lots of smallish procedures called by a few toplevel procedures, the noisy low-level output of the statistical profiler can drown out the useful information.

I think if we can push down the performance bottleneck caused by the instrumentation, it'll become a lot more useful again. This won't be easy, because some of the overhead is fundamental to its use of dynamic-wind and the way inlining is prevented by it. In the mean time, please test the statistical profiler and let me know how it works for you!

Flattr this!  (why?)

Maybe you've noticed the little Flattr icons that I've added at the bottom of each post, and now you're wondering if, and why, you should support my blogging efforts with a donation.

Let me state this first and foremost: I enjoy writing posts about subjects that interest me and I learn a lot from the research I put into my posts. So, even without donations I'd continue doing so. But, it's also a fact that many of my posts demand a lot of effort. Researching, writing and illustrating an in-depth post can take weeks, depending on my energy levels and other activities.

I hope you now understand why this blog is updated so slowly. I'm convinced that few updates with informative, long posts is better than frequent 100-word status updates of my projects. That would be a waste of your time! But because it takes so much time and effort, sometimes I struggle to find the motivation for writing new posts.

This is where Flattr comes in: a small donation is a wonderful way to express how much you liked a post, and it can really lift my spirits and motivate me to write more!

As an added advantage, by seeing which posts get Flattr'd the most I can get a better idea of what you, the readership of my blog, like and what you don't like. This will help me choose which topics to write about more.

Why I refuse to put ads on my blog

Of course, I know that placing ads on my blog would be the easiest way to get some returns from the hard work that I put into my posts. This would easily repay my hosting costs. However, in my opinion, the advertising industry is the most toxic influence on the web today. It is completely absurd that it's considered normal for companies (and individuals with blogs...) to be rewarded for filling up our screens with obnoxious garbage, and for eroding our privacy through various tracking mechanisms. They don't care that all this digital pollution is only going to be acted on by a tiny fraction of all visitors (if any). It is downright disrespectful; personally I think it's as bad as spam e-mail.

In my opinion, Flattr is a much more benign way of receiving money for my efforts. The company was created by a founder of The Pirate Bay, as a better and more direct alternative for rewarding creators than the commercial exploitation of copyright law by middle men. It's also non-unobtrusive: just a small icon in each post. The integration that I'm using is a simple hyperlink, so no nasty tracking cookies or web beacons; I'm even hosting the icon image myself.

Besides that, Flattr is completely voluntary on the visitor's part. This means that any rewards are directly linked to people's enjoyment of my posts. This encourages informative posts of substance rather than click-bait titles with disappointing content which would draw more traffic, just to get more ad impressions.

These are my views on the matter. If you're using ads on your blog, I hope I've managed to convince you to switch away from ads to visitor-supported donations. And if you like my posts, I hope you'll consider making a small donation.

Integrating Flattr into Hyde

To make this post at least a little bit on-topic, I'd like to share how I added the Flattr button to my Hyde configuration. I added the following definition to hyde.scm (slightly modified):

(use hyde uri-common)

(define (flattr-button #!optional (page (current-page)))
  (let* ((base-uri (uri-reference "https://flattr.com/submit/auto"))
         (attrs `((user_id . "YOUR FLATTR ACCOUNT NAME")
                  ;; A bit ugly, but Hyde does it like this, too
                  (url . ,(string-append ($ 'base-uri page) (page-path page)))
                  (category . "text")
                  (language . ,(or ($ 'lang page) "en_GB"))
                  (title . ,($ 'title page))))
         ;; Flattr doesn't grok semicolon as query param separator...
         (flattr-page-uri
          (parameterize ((form-urlencoded-separator "&"))
            (uri->string (update-uri base-uri query: attrs)))))

    `(a (@ (class "flattr-button")
           (href ,flattr-page-uri))
        (img (@ (src "/pics/flattr-button.svg")
                (alt "Flattr this!")
                (title "Flattr this post!"))))))

Then, in my SXML definition for article layouts which I call, surprise, surprise! article.sxml:

;; -*- Scheme -*-
()

`(div (@ (class "article"))

        ,@(prev/next-navigation)

        (h1 (@ (class "article-title"))
            ,($ 'title) " "
            (small (@ (class "date"))
                   "Posted on " ,(format-seconds (page-updated))))

        (div (@ (class "article-body"))
             (inject ,contents)
             ,(flattr-button))

        ,@(prev/next-navigation))

The button immediately follows the contents of the blog post, which makes sense because only after reading a blog post, you'll know whether it was worth something to you.

Flattr this!  (why?)

In my earlier post about the garbage collector, I lied a little bit about the data representation that CHICKEN uses. At the end of the post I briefly mentioned how CHICKEN really stores objects. If you want to fully understand the way CHICKEN works, it is important to have a good grasp on how it stores data internally.

Basic idea

CHICKEN attempts to store data in the most "native" way it can. Even though it's written in C, it tries hard to use machine words everywhere. So on a 32-bit machine, the native code that's eventually generated will use 32-bit wide integers and pointers. On a 64-bit machine it will use 64-bit wide integers and pointers.

This is known as a C_word, which is usually defined as an int or a long, depending on the platform. By the way, the C_ prefix stands for CHICKEN, not the C language. Every Scheme value is represented as a C_word internally. To understand how this can work, you need to know that there are roughly two kinds of objects.

Immediate values

First, there are the immediate values. These are the typical "atomic" values that come up a lot in computations. It is important to represent these as efficiently as possible, so they are packed directly in a C_word. This includes booleans, the empty list, small integers (these are called fixnums), characters and a few other special values.

Because these values are represented directly by a C_word, they can be compared in one instruction: eq? in Scheme. These values do not need to be heap-allocated: they fit directly in a register, and can be passed around "by value" in C. This also means they don't need to be tracked by the garbage collector!

At a high enough level, these values simply look like this:

This doesn't really show anything, does it? Well, bear with me...

Block objects

The other kind of value is the block object. This is a value that is represented as a pointer to a structure that contains a header and a variable-length data block.

The data block is a pointer which can conceptually be one of two types. In case of a string or srfi-4 object, the data block is simply an opaque "blob" or byte-vector. In most other cases, the block is a compound value consisting of other Scheme objects. Typical examples are pairs, vectors and records.

Because these values are heap-allocated, two distinct objects are not stored at the same memory address, even if they store the same value. That's why comparing their values is a complex operation. This operation is either equal? for deep structural comparison, or eqv? for value comparisons of numbers and symbols.

The R5RS specification explains that the difference between eq? and eqv? is not necessarily the same across Scheme implementations. For example, in CHICKEN, eq? can be used to compare characters and fixnums, because they are stored as immediate values. Portable programs should not rely on that. If you use eq? on block objects, their pointers will be compared. That means it checks whether they are one and the same object. This can be a useful operation in its own right.

Objects represented by data blocks also have to be tracked by the garbage collector: if there are still references to the block, its data must be copied (recursively) to keep it alive across GC events.

Here are some "high-level" examples of block objects:

This picture should look somewhat familiar to students of SICP: it is reminiscent of the box-and-pointer notation used to illustrate the structure of lists. The boxes containing green text represent the object headers. The header indicates the type of object and the object's size. It also determines whether the object's data block is a byte block or a block containing Scheme objects: if it contains Scheme objects, the header tells us how many slots (locations for storing Scheme objects) the object has. Byte blocks, on the other hand, are opaque and can contain any data. Their size is stored as a byte count.

From top to bottom, left to right, these represent the following values:

  • (#\a . #\b) is a pair containing the character "a" in its car and "b" in its cdr.
  • #(#f 123 456 #f 42) is a regular Scheme vector containing fixnums and false values.
  • "hello" is a string consisting of 5 characters (strings are treated as byte vectors in CHICKEN).
  • 12.5 is an inexact representation of the number twelve and a half (a "flonum"). This is a byte block storing the raw byte value of a C double.
  • ("hello" . (12.5 . ())) is the first pair of a proper list which contains a string and a flonum.
  • (12.5 . ()) is the cdr of that list; a pair containing a number and the end-of-list marker.

The final two pair objects show that slots (like any C_word) can hold not only immediate values, but also pointers to block objects. This leads us to the question: how to differentiate between a pointer to an object and an immediate object?

Bit fiddling

Most platforms require pointers to words to be aligned on a word boundary. Thus, on a 32-bit machine, memory addresses will always have zero in the lower 2 bits, because we can only point to multiples of 4 bytes. On a 64-bit machine, word addresses will have zero in the lower 4 bits.

Because the lower two bits are never used, we can perform a simple trick: any value that has either of the lower two bits set cannot be a word pointer, so we enforce immediate objects to have either bit set. It may feel like a gross hack to people who are used to working with "clean", high-level C code, but it is a technique which goes back a long way: Orbit, one of the earliest optimising compilers for Scheme, did exactly the same thing. Other modern Schemes like Larceny and Gambit do the same thing. Even Scheme48, which is probably the cleanest Scheme implementation, uses tagged words. Other Lisps use this representation as well. See Steel Bank Common Lisp, for example.

Many other dynamic languages don't use a packed data representation like this. Many prefer the simpler but bulkier struct representation. At the other end of the spectrum, we have statically typed, non-garbage collected languages. They generally don't need to store the type of a value along with it. Instead, they can directly store the "un-boxed" value in memory. This, and the relation to garbage collection, is explained rather well in Appel's 1989 paper "Runtime Tags Aren't Necessary" (sorry, this is in PostScript format).

Representation of objects

We've learned how CHICKEN distinguishes between pointers to (block) objects and immediate values. Now we will look into the nitty-gritty details of the object representation.

We can make the following breakdown of bit patterns (assuming a 32-bit platform):

This shows that the lower two bits can be used to distinguish between block objects (zero) and immediate objects (nonzero). For immediate objects, the low bit can be used to distinguish between fixnum objects and other kinds of immediate objects. The colouring indicates which bits are used for tagging objects of that kind. The uncoloured bits are used for representing the object being stored.

Fixnums are distinguished from "other immediate" values because fixnums are so incredibly common: they are used for indexing into strings, loop counters and many calculations. These have to be represented as efficiently as possible while storing the widest possible range of values. Run time type checking for fixnums should use as few CPU instructions as possible.

The "other immediate" types are further differentiated through the top two bits of the lower nibble:

The unused "other immediate" type of 0010 is reserved for future use. To get a good feel for the representation of immediates, let us look at a few example bit patterns. I'll also show you how to construct them in C.

Bit patterns of immediate values

Fixnums

These small integer values are stored in regular old two's complement representation, like the CPU uses. The lowest bit is always 1, due to the fixnum tag bit. The highest bit is used to determine the sign of the number.

The C_fix() macro shifts its argument one bit to the left, and sets the lower bit through a bit-wise OR with 1. To convert a Scheme fixnum back to a C integer, you can use the C_unfix() macro. This shifts its argument one bit to the right.

You might wonder what happens when you calculate or enter a very large integer. In CHICKEN 4, it will be coerced to a flonum. In CHICKEN 5, it will be stored as a bignum. Bignums are block objects, not immediates, because they may be arbitrarily large.

Booleans

That's a very large bit space for only two values. However, reserving a special type tag just for booleans simplifies type detection code: we only have to compare the lower four bits with 0110 to check whether an object is a boolean.

Characters

Characters do not make full use of the available bits, because the lower byte's high nibble is always 0000. This means that only 24 bits are available for representing the character on 32-bit platforms. Luckily, this is enough for representing the full Unicode range. If Unicode ever starts using up a bigger code space, we can always sneak in 4 more bits.

Special objects

This list is exhaustive: currently there are only four special objects. There is a lot of room for adding other special objects, if that ever becomes necessary.

The "unbound variable" representation cannot be captured by a program: when it is evaluated, it immediately raises an exception. This is its intended function.

A closer look at block objects

Now that we know all about immediate values, let's turn to block objects. These are represented by a pointer to a C structure with a header and a data block. Slightly simplified, it looks like this:

#define C_uword  unsigned C_word
#define C_header C_uword

typedef struct
{
  C_header header;
  C_word data[];    /* Variable-length array: header determines length */
} C_SCHEME_BLOCK;

The header's bit pattern is broken up into three parts:

The bottom 24 bits encode the size of the object. On 64-bit machines, the bottom 56 bits are used for the size. The middle 4 bits encode the type of the object. The top 4 bits encode special properties to make the garbage collector's work easier:

  • C_GC_FORWARDING_BIT indicates this object has been forwarded elsewhere. To find the object at its new location, the entire header is shifted to the left (which shifts out this bit). Then, the value is reinterpreted as a pointer. Remember, the lowest two bits of word pointers are always zero, so we can do this with impunity!
  • C_BYTEBLOCK_BIT indicates this is a byte blob (size bits are interpreted in bytes, not words).
  • C_SPECIALBLOCK_BIT indicates that the first slot is special and should be skipped by the GC.
  • C_8ALIGN_BIT indicates that for this object, alignment must be maintained at an 8-byte boundary.

The type bits are assigned incrementally. There is room for 16 types, only 2 of which are currently unused. Let's look at the definitions, which should also help to explain the practical use of the latter 3 GC bits:

#define C_SYMBOL_TYPE            (0x01000000L)
#define C_STRING_TYPE            (0x02000000L | C_BYTEBLOCK_BIT)
#define C_PAIR_TYPE              (0x03000000L)
#define C_CLOSURE_TYPE           (0x04000000L | C_SPECIALBLOCK_BIT)
#define C_FLONUM_TYPE            (0x05000000L | C_BYTEBLOCK_BIT | C_8ALIGN_BIT)
/*      unused                   (0x06000000L ...) */
#define C_PORT_TYPE              (0x07000000L | C_SPECIALBLOCK_BIT)
#define C_STRUCTURE_TYPE         (0x08000000L)
#define C_POINTER_TYPE           (0x09000000L | C_SPECIALBLOCK_BIT)
#define C_LOCATIVE_TYPE          (0x0a000000L | C_SPECIALBLOCK_BIT)
#define C_TAGGED_POINTER_TYPE    (0x0b000000L | C_SPECIALBLOCK_BIT)
#define C_SWIG_POINTER_TYPE      (0x0c000000L | C_SPECIALBLOCK_BIT)
#define C_LAMBDA_INFO_TYPE       (0x0d000000L | C_BYTEBLOCK_BIT)
/*      unused                   (0x0e000000L ...) */
#define C_BUCKET_TYPE            (0x0f000000L)

Most of the types should be self-explanatory to a seasoned Schemer, but a few things deserve further explanation.

You'll note that in the STRING type tag, C_BYTEBLOCK_BIT is also set, for obvious reasons: strings do not consist of slots containing Scheme values, but of bytes, which are opaque. Because the header's size bits store the length in bytes instead of in words, we can spot a very important limitation: CHICKEN strings can only hold 16 MiB of data on a 32-bit machine (on a 64-bit machine, strings are "limited" to 65536 TiB).

The CLOSURE type uses C_SPECIALBLOCK_BIT. This indicates to the garbage collector that the first slot contains a raw non-Scheme value. In the case of a closure, it contains a pointer to a C function. The other slots contain free variables that were closed over ("captured") by the lambda, which are normal Scheme objects. The compiled C function "knows" which variable lives in which slot.

The FLONUM type uses C_BYTEBLOCK_BIT, because an un-boxed C double value is not a Scheme object: we want to treat the data as an opaque blob. On a 32-bit system, the double will take up two machine words, so we can't use C_SPECIALBLOCK_BIT. The header will therefore hold the value 8 as its size. It also has another GC bit: C_8ALIGN_BIT. This ensures that the 64-bit double is aligned on a 8-byte boundary, to avoid unaligned access on 32-bit systems. This adds some complexity to garbage collection and memory allocation.

The STRUCTURE type refers to a SRFI-9 type of record object. Its slots hold the record's fields, and the accessors and constructors "know" which field is stored at which index.

The POINTER type holds a raw C pointer inside a Scheme object. Again, because C pointers are not Scheme objects, the object's first (and only) slot is treated specially, via C_SPECIALBLOCK_BIT.

The LOCATIVE type represents a rather complicated object. It acts a bit like a pointer into a slab of memory. You can use it as a single value which represents a location inside another block object. This can then be used as an argument to a foreign function that expects a pointer. Its first slot holds a raw pointer. The other slots hold the offset, the type of pointer (encoded as fixnum) and the original object, unless it is a weak reference.

The TAGGED_POINTER type is exactly like POINTER, but it has an extra user-defined tag. This can make it easier for code to identify the pointer's type. The tag is a Scheme value held in its second slot.

The SWIG_POINTER has been removed in CHICKEN 5 and was used for compatibility with SWIG. It is basically the same as POINTER, with additional SWIG data added to it.

The LAMBDA_INFO type stores procedure introspection information (mostly for debugging).

The BUCKET type is a special internal pair-like object which is used in the linked list of symbols under a hash table bucket in the symbol table. It does not count as a reference, so that symbols can be garbage collected when only the symbol table still refers to them.

So far, the only numeric types we've seen are fixnums and flonums. What about the other numeric types? After all, CHICKEN 5 will (finally) have a full numeric tower!

In CHICKEN 5, rational and complex numbers are viewed as two simpler numbers stuck together. They're stored as records with a special tag, which the run-time system recognises. Bignums are a different story altogether. When I first implemented them, they used one of the two unused header types in the list above. For various reasons I won't go into now, they are now also represented as a record with a special tag and a slot that refers to the byte blob containing the actual bignum value. Perhaps this is something for a later blog post.

Putting it all together in the garbage collector

So far, all of this perhaps sounds rather arbitrary and complex. The data representation is finely tuned to fit the garbage collector, and vice versa, so it may help to see how this simplifies the garbage collector.

The way the data representation is set up, the garbage collector only has to perform a few very basic checks. It does not need to know about any of the data types at all, it only needs to look at the special GC bits, and the size of an object!

Now we're finally ready to understand the heart of the garbage collector, which scans the live data and marks nested objects. This part of CHICKEN implements the Cheney algorithm. It's only 22 lines of code, without any simplifications. This is taken directly from runtime.c, with comments added for exposition:

/* Mark nested values in already moved (marked) blocks
   in breadth-first manner: */
while(heap_scan_top < (gc_mode == GC_MINOR ? C_fromspace_top : tospace_top)) {
  bp = (C_SCHEME_BLOCK *)heap_scan_top; /* Get next object from queue */

  /* If this word is an alignment hole marker, skip it */
  if(*((C_word *)bp) == ALIGNMENT_HOLE_MARKER)
    bp = (C_SCHEME_BLOCK *)((C_word *)bp + 1);

  n = C_header_size(bp);  /* Extract size bits from header */
  h = bp->header;         /* Remember header for masking other bits */
  bytes = (h & C_BYTEBLOCK_BIT) ? n : n * sizeof(C_word);  /* Size in bytes */
  p = bp->data;           /* Data block (first slot) */

  if(n > 0 && (h & C_BYTEBLOCK_BIT) == 0) { /* Contains slots, not bytes? */
    if(h & C_SPECIALBLOCK_BIT) { /* Skip first word (not a Scheme object) */
      --n;
      ++p;
    }

    while(n--) mark(p++); /* Mark Scheme objects in data slots */
  }

  /* Advance onto next word just after object */
  heap_scan_top = (C_byte *)bp + C_align(bytes) + sizeof(C_word);
}

The comment at the start refers to the fact that the "tip of the iceberg" of live data has already been copied; this code scans that set for nested objects referred to by those live objects. See my post about the garbage collector for more about how the GC and Cheney's algorithm work.

If we're in a minor GC, this code scans over the fromspace, which is the memory area into which the nursery objects will be copied. If we're in a major GC, we're scanning over tospace, which is the other half of the heap, to which the fromspace will be copied.

The code above simply advances the heap_scan_top pointer over the objects we need to look at until we hit the end of this space. It then checks for an ALIGNMENT_HOLE_MARKER, which is a magic value that gets used as a placeholder to indicate that this machine word should be skipped. This placeholder may get inserted when allocating a C_8ALIGN_BIT object, to avoid unaligned access.

Next, the size (in bytes) of the object is determined, based on the C_BYTEBLOCK_BIT. Finally, if it's a data block (C_BYTEBLOCK_BIT is not set), we loop over the data slots. The first word is skipped if it's indicated as "special" via C_SPECIALBLOCK_BIT.

The mark() call hides the hairy part. It performs the following steps:

  • Check that the word contains a block object. Otherwise, return because it's an immediate value.
  • Check that the word points to memory that's being moved, otherwise return. This avoids copying already copied or evicted data.
  • If the object has the C_GC_FORWARDING_BIT set, just update the marked slot with the new location the object was forwarded to, and return.
  • If we're on a 32-bit machine, the object to be copied has the C_8ALIGN_BIT set, and the current top of the target heap area is not aligned, insert an ALIGNMENT_HOLE_MARKER.
  • In case the target area is too small to hold the object, interrupt the current GC and trigger the "next" GC type. This will be a major collection if we're currently doing a minor collection, or a heap reallocating major collection if we're in a regular major collection.
  • Finally, copy the object via a simple memcpy().

Because this is done by mark() and not by the scanning code shown above, all this is only performed if the object in question is a block object which needs to be copied (the mark() macro inlines the first check). Just scanning the live data is extremely fast. We can thank the data representation's simplicity for that speed!

Further reading

There is a lot of information about this stuff, but it can be a little hard to find. Here are a few links I found during my research for this blog post:

Flattr this!  (why?)

Older articles...
Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution 3.0 License. All code fragments on this site are hereby put in the public domain.