Let's add a statistical profiler to CHICKEN!

I just submitted a patch to add a statistical profiler to CHICKEN. In this post, I'll explain how it works. It's easier than you'd think!

Instrumentation-based profiling

There are two major ways to profile a program. The first way has been supported by CHICKEN as long as I can remember: You add instrumentation to each procedure. This counts how often the procedure is called, and how much time is spent in it.

You've probably done this by hand: You check the clock before and after calling a procedure, and print the difference. This can be useful when whittling down a specific procedure's run time. But when you want to know where the bottlenecks are in the first place, it's less practical. You don't want to manually add this stuff to all your procedures!

In order to easily instrument each procedure, you'll need language support, either in the compiler or in the run time. Unfortunately, the instrumentation itself will cause your program to slow down: all this tracking takes some time! That's why CHICKEN's profiler is part of the compiler: Instrumentation is emitted only when you compile with -profile. This option adds wrappers around each procedure, which look like this:

;; Original source code:
(define (foo a b c)
  (print (+ a b) c))

;; Instrumented version created by -profile:
(define foo
  (lambda args
    (dynamic-wind  ; Explained later
      (lambda () (##sys#profile-entry 0 profile-info-1234.4567))
      (lambda () (apply (lambda (a b c) (print (+ a b) c)) args))
      (lambda () (##sys#profile-exit 0 profile-info-1234.4567)) ) ) )

In the above code, ##sys#profile-entry starts the clock and increments the call count for this procedure, and ##sys#profile-exit stops the clock. The profile-info-1234.4567 is a global vector in which each procedure of this compilation unit is assigned a unique position.

To create the vector and assign the positions, a prelude is added to the compilation unit. This defines the vector and registers a position for each procedure:

;; Prelude for entire program:
(define profile-info-1234.4567 (##sys#register-profile-info 1 #t))
(##sys#set-profile-info-vector! profile-info-1234.4567 0 'foo)

;; Not exactly true, but let's pretend because it's close enough:
(on-exit ##sys#finish-profile)

This simply creates a vector of size one, and assigns the foo procedure to position zero. Then it requests ##sys#finish-profile to run when the program exits. This will write profile information to disk on exit.

We need `dynamic-wind`, but it creates problems

If you're not familiar with it, dynamic-wind is a sort of "guard" or "try/finally". Whenever the second lambda is entered, the first lambda ("before") is invoked, and when it is left, the third lambda ("after") is invoked. To understand why we need a dynamic-wind in the code presented earlier, consider the naive, incorrect implementation:

;; INCORRECT expansion:
(define foo
  (lambda args
    (begin
      (##sys#profile-entry 0 profile-info-1234.4567)
      (apply (lambda (a b c) (print (+ a b) c)) args)
      ;; Not reached if + throws an exception.  This
      ;; will happen when a or b are not numbers!
      (##sys#profile-exit 0 profile-info-1234.4567) ) ) )

Here, the "after" part will be skipped if the procedure raises an exception. Furthermore, a continuation might be captured or called in the procedure, even multiple times. This would cause the code to jump in and out of the procedure without neatly going over the before/after bits every time. Because of this, the profiler would miss these exits and re-entries, and hence it would not be able to accurately keep track of the time actually spent in this procedure.

The code with dynamic-wind will take care of non-local exits, stopping the clock whenever we jump out of the procedure, and starting it when we jump in again via a captured continuation.

While dynamic-wind is necessary, it also implies quite a heavy hit on performance: dynamic-wind isn't cheap. Even worse is the fact that it prevents the compiler from inlining small procedures in larger ones. Furthermore, the use of apply implies we'll need to cons up an argument list. Normally, arguments aren't put into a Scheme list, because doing so results in more garbage being created. This means that the performance shape of the profiled application can be quite different from the original, non-profiled application!

Statistical profiling

I've always wanted to look into fixing the profiler, but never had the energy to do so. Now that Felix wrote an entire graphical debugger for CHICKEN, I thought maybe I could lean on the debugger's infrastructure to make a better profiler. But it turned out I didn't have to!

First, I should explain statistical profiling. The basic idea is that the process is periodically sampled while it's running. These samples are taken by inspecting the instruction pointer and mapping it to a procedure. If you do this often enough (every 10 ms or so), you can get a pretty good idea of where the program is spending most of its time.

CHICKEN's trace buffer

Looking at the instruction pointer or C function is not very useful in CHICKEN, unless you like to pore over endless piles of machine-generated C code and to mentally map it back to Scheme. It can be educational and even fun, just like it can be fun and educational to read the assembly output of a C compiler, but it is generally unproductive and gets frustrating quickly.

So how can we take a snapshot of what a Scheme program is doing at any given time? It turns out that CHICKEN already does this: when a Scheme program raises an exception, the interpreter will show you a call trace. This is a bit like a stack trace in a "traditional" language, only it shows a trace of the execution flow that led to the error. Let's look at a contrived example:

(define (fib n)
  (if (< n 2)
      n
      (+ (fib (- n 1)) (fib (- n 2))) ) )

(define (run-fib)
  (let* ((n (string->number (car (command-line-arguments))))
         (result (fib n)))
    ;; This line is wrong: It tries to append a number to a string
    (print (string-append "Result: " result)) ) )

(run-fib)

When you compile the program and run it, you'll see the output:

 $ csc -O3 fib.scm
 $ ./fib 3
 
 Error: (string-append) bad argument type - not a string: 2
 
         Call history:
         
         fib.scm:12: run-fib
         fib.scm:7: command-line-arguments
         fib.scm:8: fib
         fib.scm:4: fib
         fib.scm:4: fib
         fib.scm:4: fib
         fib.scm:4: fib
         fib.scm:10: string-append               <--

In a stack trace you would see only two lines, mentioning run-fib and string-append. Here we can see the trace of execution through the program, where it entered run-fib, then called command-line-arguments, proceeded to invoke fib five times in total, and finally it invoked string-append with the result.

There are advantages and disadvantages to either approach, but a trace makes more sense in languages with full continuations. It is the natural thing to do in a compiler that converts all programs to continuation-passing style. It does tend to confuse beginners, though!

So, these trace points are already inserted into every program by default. This happens even in programs optimised with -O3, because trace points have very little overhead: It's just a pointer into a ring buffer that gets updated to point to the procedure's name. You can choose to omit traces completely via -no-trace or -d0, but that's not the default.

Trace points are a good fit for our profiler: when taking a sample, we can simply take a look at the newest entry in the trace buffer, which will always reflect the procedure that's currently running!

Setting up a sampler

So how can we interrupt the program at a point in time, without messing with the program's state? This is simple: we ask for a signal to be delivered periodically. There's even a dedicated signal reserved for this very task: SIGPROF.

We'll use setitimer() to set up the timer which causes the signal to be delivered, even though POSIX says it's obsolete. It is much more convenient and more widely supported than the alternative, timer_create() plus timer_settime(). We can always switch when setitimer() is removed from an actual POSIX implementation.

The following setup code is simplified a bit (most importantly, error handling is omitted):

C_word CHICKEN_run(void *toplevel) { /* Initialisation code for CHICKEN */
  /* ... a lot more code ... */

  if (profiling) {
    struct itimerval itv;
    time_t freq = 10000;                   /* 10 msec (in usec) */

    itv.it_value.tv_sec = freq / 1000000;
    itv.it_value.tv_usec = freq % 1000000;
    itv.it_interval.tv_sec = itv.it_value.tv_sec;
    itv.it_interval.tv_usec = itv.it_value.tv_usec;

    setitimer(ITIMER_PROF, &itv, NULL);
    signal(SIGPROF, profile_signal_handler);
  }

  /* ... a lot more code ... */
}

This sets up a profile timer. When such a timer expires, the kernel will send SIGPROF to the process. The code also registers a signal handler that will be invoked when this happens. It looks like this, much simplified:

struct profile_item           /* Item in our profiling hash table */
{
  char *key;                  /* Procedure name, taken from trace buffer */
  unsigned int sample_count;  /* Times this procedure was seen */
  struct profile_item *next;  /* Next bucket chain */
};

void profile_signal_handler(int signum)
{
  struct profile_item *pi;
  char *procedure_name;

  procedure_name = get_topmost_trace_entry();

  pi = profile_table_lookup(procedure_name);

  if (pi == NULL) {
    pi = profile_table_insert(procedure_name);
    pi->sample_count = 1;
  } else {
    pi->sample_count++;
  }
}

Maybe you are wondering why we're doing this in C rather than Scheme. After all, Scheme code is more readable and easier to maintain. There are a few reasons for that:

Scheme signal handlers are blocked in some critical sections of the run time library. This would delay profiling until the program is back in user code, skewing the results.
Core is compiled with -no-trace by default, but this can be turned off. Doing so can be useful when profiling core procedures, but not with a signal handler in Scheme. It would see its own code in the trace buffer, instead of what we want to trace!
The profiling code should be as low-overhead as possible, to avoid affecting the results. Remember, this is one of the main problems with the instrumenting profiler! While CHICKEN produces fast code, it is faster if we do it directly in C, and it will trigger no garbage collections.

After the program has completed its run, we must write the profile to a file. I won't show it here, but the CHICKEN implementation simply writes each key in the hash table with its call count and the time spent in that procedure. The time is estimated by multiplying the sampling frequency by the call count.

This means we'll miss some calls, and therefore we'll under-represent the time taken by those procedures. On the other hand, some procedures are over-represented: if a sample is taken for a very fast procedure, we'll assign 10ms to it, even if it runs in a fraction of that. This is the essence of the statistical approach: if a program runs long enough, these measurement errors should balance out.

Comparing the two profilers

It's hard to come up with a small but representative example which is self-contained, so I'll use a few benchmarks from Mario's collection.

Low-level vs high-level

The first benchmark we'll profile is the kernwyk-array benchmark. It's taken from a historical set of benchmarks by Kernighan and Van Wyk, designed to compare the performance of various "scripting languages". This particular benchmark creates an array containing a million numbers by destructively initialising it. After that, it creates a second array into which the first is copied. This is repeated 100 times.

If we compile this with csc -O3 -profile and run it, the original profiler gives us the following breakdown:

procedure	calls	seconds	average	percent
`my-try`	100	4.008	0.040	100.000
`go`	1	4.008	4.008	100.000
`create-y`	100	2.311	0.023	57.684
`create-x`	100	1.696	0.016	42.315

If we compile this with csc -O3 and run it with -:p under the new profiler, we'll get a radically different result, even though the total run time did not change much:

procedure	calls	seconds	average	percent
`kernwyk-array.scm:12:make-vector`	100	2.400	0.024	58.679
`kernwyk-array.scm:5:make-vector`	100	1.690	0.016	41.320

The statistical profiler is a bit more "low-level": It tells you exactly the procedure call and line that is taking the most time. On the other hand, the instrumentation-based profiler shows us a breakdown in which procedure the most time is spent. The percentages and "seconds" column are also different: the original profiler shows us the cumulative time each procedure takes up. Thus, a main entry point will always be at 100% at the top.

But the most significant difference is in what this tells us about where the time is spent: the original profiler tells us that create-y is a little slower than create-x. Reading such output would lead me to think that probably vector-ref and vector-set! take the most time. If we remove all calls to those, the program takes 2.6 seconds, and the profiler output looks more or less the same, so they're not the biggest contributor to the total run time. Instead, make-vector is, due to the fact that it allocates, which will cause garbage collections. And garbage collections are the real time consumers in this benchmark!

Precision of the two profilers

The next benchmark we'll look at is the nfa benchmark. I'm not sure about the origins of this benchmark. It emulates a depth-first NFA search for the regular expression ((a | c)* b c d) | (a* b c). This is matched against a string containing 133 "a" characters followed by "bc".

The output of the instrumenting profiler is completely useless, because only toplevel procedures are instrumented:

procedure	calls	seconds	average	percent
`recursive-nfa`	150000	8.071	0.000	100.000

This is another reason for wanting to "fix" the profiler: it doesn't give an inside view of where large procedures or closures are spending their time. You can manually tweak the programs by lifting all the inner procedures up to the toplevel. If these procedures close over some variables, you must turn those into extra arguments and pass them along when calling these procedures. It's a bit tedious, but if we do this for the benchmark, we'll get output that is more useful:

procedure	calls	seconds	average	percent
`recursive-nfa`	150000	30.687	0.00020458	100.000
`state0`	150000	30.559	0.00020373	99.582
`state1`	20100000	23.372	0.00000116	76.160
`state2`	20100000	8.423	0.00000041	27.450
`state3`	20100000	7.080	0.00000035	23.070
`state4`	150000	0.132	0.00000088	0.430

Now, if we take the original, untweaked program, and run it through the statistical profiler, we'll immediately get useful output:

procedure	calls	seconds	average	percent
`nfa.scm:14:state1`	146	5.830	0.039	72.602
`nfa.scm:31:state3`	141	2.160	0.015	26.899
`nfa.scm:9:state1`	2	0.020	0.010	0.249
`nfa.scm:44:##sys#display-times`	1	0.010	0.010	0.124
`nfa.scm:9:state3`	1	0.010	0.010	0.124

This also shows a disadvantage of the statistical profiler: the call counts are all wrong! That's because the state procedures are extremely fast: in the original profiler you can see that they run 20 million times in 8 seconds or so. Because they're so fast, the average time per call is close to zero. This results in the timer being too slow for sampling each procedure while it is running. It's so slow we only see an extremely small fraction of all calls!

Nevertheless, we can clearly tell that most time is spent in state1 and state3. The calls for state2 are not even registering, because this state will return much sooner: there are almost no b, c or d characters in the input pattern, so it will just quickly "fall through" this procedure without a match. The reason it shows up in the original profiler is because the instrumentation itself is interfering with an accurate reading of time spent in the procedure.

The total run time of the instrumented and tweaked version is almost 31 seconds, while the total run time of the version with statistical profiling is less than 8 seconds on my laptop! Let's take a closer look at that overhead.

Instrumentation overhead

Having two distinct ways of gathering profile data opens up a really cool opportunity. We can measure the overhead introduced by the instrumentation profiler, by running it under the statistical profiler!

Let's do that on the "tweaked" NFA benchmark again:

 $ csc -O3 -profile nfa.scm
 $ ./nfa -:p
 30.88s CPU time, 2.404s GC time (major),
 163050000/149475510 mutations (total/tracked),
 5972/113664 GCs (major/minor)

And let's look at the result:

procedure	calls	seconds	average	percent
`##sys#profile-entry`	701	17.000	0.024	55.051
`##sys#profile-exit`	633	10.710	0.016	34.682
`##sys#dynamic-wind`	145	1.660	0.011	5.375
`nfa.scm:15:state2`	61	0.670	0.010	2.169
`nfa.scm:29:state3`	41	0.440	0.010	1.424
`nfa.scm:12:state1`	31	0.340	0.010	1.101
`nfa.scm:41:state0`	3	0.040	0.013	0.129
`nfa.scm:49:recursive-nfa`	2	0.020	0.010	0.064

This clearly shows that the instrumentation machinery completely dominates the profile: A stunning 27 seconds are being soaked up by ##sys#profile-entry and ##sys#profile-exit!

Of course, this is just a contrived example of profiling a benchmark. An experienced Chickeneer would have been able to just tell from the output of the benchmark with and without profiling:

 $ csc -O3 nfa.scm
 $ ./nfa
 8.156s CPU time, 0.012s GC time (major), 32/6849 GCs (major/minor)
 $ csc -O3 -profile nfa.scm
 $ ./nfa
 30.628s CPU time, 2.316s GC time (major),
 163050000/149475510 mutations (total/tracked),
 5970/113666 GCs (major/minor)

Aside from the obvious increase in CPU time, you can see that the number of mutations went from zero to more than a hundred million. The number of garbage collections is also many times higher. This would spell certain doom if you saw it in a real program!

Conclusion

It's too early to be sure, but it looks like a statistical profiler is a useful alternative to instrumentation. On the other hand, for some programs the situation is reversed:

Programs blocking SIGPROF offer fewer sampling opportunities, resulting in incomplete profiles.
If you need the exact number of procedure calls, instrumentation is your only real option.
When there are lots of smallish procedures called by a few toplevel procedures, the noisy low-level output of the statistical profiler can drown out the useful information.

I think if we can push down the performance bottleneck caused by the instrumentation, it'll become a lot more useful again. This won't be easy, because some of the overhead is fundamental to its use of dynamic-wind and the way inlining is prevented by it. In the mean time, please test the statistical profiler and let me know how it works for you!

More magic

Cautionary tales from a programmer

About this blog

Let's add a statistical profiler to CHICKEN! Posted on 2016-01-04

Instrumentation-based profiling

We need `dynamic-wind`, but it creates problems

Statistical profiling

CHICKEN's trace buffer

Setting up a sampler

Comparing the two profilers

Low-level vs high-level

Precision of the two profilers

Instrumentation overhead

Conclusion

More magic

Cautionary tales from a programmer

About this blog

Let's add a statistical profiler to CHICKEN! Posted on 2016-01-04

Instrumentation-based profiling

We need dynamic-wind, but it creates problems

Statistical profiling

CHICKEN's trace buffer

Setting up a sampler

Comparing the two profilers

Low-level vs high-level

Precision of the two profilers

Instrumentation overhead

Conclusion

We need `dynamic-wind`, but it creates problems