About this blog

Hi, I'm Peter Bex, a Scheme and free software enthusiast from the Netherlands. See my user page on the CHICKEN wiki or my git server for some of my projects.


Are you in need of support for CHICKEN Scheme? Or maybe you want clear tech writing like on this very blog? Then you're in luck! I am now available as a freelance consultant!

The 3 most recent posts (archive) Atom feed

We're getting close to a CHICKEN 5 release, so let's take a look at the cool new stuff!

Overhaul of built-in modules

The biggest change you'll notice when you fire up CHICKEN and start to use it is that the modules that come shipped with core are completely different from CHICKEN 4. The functionality is mostly the same, but we moved things around (a lot!) to make things more logical.

This is also the main reason we decided to bump the major version number: the modules have different names, procedures have been renamed, merged or dropped.

You can take a look at the complete list in the CHICKEN 5 manual. We've taken the module layout from R7RS small as inspiration, but since CHICKEN is still an R5RS Scheme first (with r7rs being an optional extension) we had to make some changes.

So, we define a scheme module which contains the entire R5RS language. For everything that is a CHICKEN-specific extension to standard R5RS Scheme, we put it under a (chicken ...) name, which tries to follow the R7RS naming conventions.

For example, R7RS defines a (scheme process-context) module with the following procedures:

  • command-line
  • exit
  • emergency-exit
  • get-environment-variable
  • get-environment-variables

Likewise, CHICKEN defines a (chicken process-context) module, which is a superset of the corresponding R7RS module. Take a look at its manual page; you can see that it defines many more procedures, but it includes all the standard ones too.

By using the R7RS names but with scheme replaced by chicken, the new modules should be easy to remember for anyone used to R7RS. Of course, you can still write portable standard R7RS programs via the r7rs egg, which defines a 100% compatible (scheme process-context) module with only the R7RS identifiers.

There is one important caveat: Because our scheme modules exports everything from R5RS Scheme, we don't provide, say, a (chicken cxr) module for all the cadadr, caddar and so on, because those are all in scheme. This also means that the (chicken load) module does not export load; that's already in scheme. Instead, it defines various non-standard CHICKEN extensions like load-relative and such.

Saner module imports

Speaking of modules, we've improved the way modules are linked into user code. In CHICKEN 4, there's a very strict distinction between modules and (compilation) units. This was an endless source of confusion for beginners. For example, why did (import foo) give an error when you tried to actually refer to an identifier from the foo module? That's because import didn't actually load the code, just the import library. To actually load the code and import the library, you needed (use foo). You could also load the code without importing it via (require-library foo). This should help with cross-compilation. The idea was that you would only need to load the import library on the host, and have the library itself compiled for the target, but in practice you needed to compile the library twice anyway (once on the host, once for the target).

We got rid of this mess: now the canonical way to import the foo library is simply (import foo). For more info, see this post by Felix outlining how to improve imports.

Full numeric tower

Of course, support for the full numeric tower is a personal favorite of mine, having spent a lot of time to perfect this stuff!

Most importantly, this means you no longer need to worry about integer computations over- or underflowing into a flonum and all the weird floating-point problems that entails. Bignums are also a necessity when dealing with 64-bit numeric C types in the FFI. For example, we finally support the size_t type correctly. To me, complex numbers and exact fractions (aka rational numbers) are a nice added bonus, as you could already get them before with the numbers egg. However, by having these types built-in, they're more efficient and you don't have to worry about passing these numbers to code that can't handle them because support happened not to be compiled in.

Take some time to read my blog series about the numeric tower if you're interested in the details.

Declarative egg description language

The chicken-install program to install eggs was rewritten along with all the surrounding tools. The main reason to do this was to make the life of package maintainers easier.

The old version of chicken-install would download, build, install and (optionally) run the unit tests as part of one command. If any dependencies were missing, it would also recursively download, build, install and run tests for those as well. The new version cleanly separates these steps, by generating shell scripts (batch files on Windows) that can do the necessary actions to build and install.

To make this easier, we also had to re-think the egg "language". In CHICKEN 4, a .setup-file was simply a Scheme program in which a few helper procedures were available for calling the compiler. This means it's impossible to create a simple shell script that will separate the build and install steps. That's why we now have a separate, declarative file which describes the components of an egg. See the .egg file documentation for a concrete example.

The rewritten chicken-install will now also cache eggs to avoid re-downloading the same eggs again and again. By default the cache is stored in a dot-directory under the user's home directory. This can be overridden with the CHICKEN_EGG_CACHE environment variable, which might also help package maintainers take the distributed files from another location.

See these design notes for more information about the goals and motivations behind the rewrite.

Improved support for static compilation

In principle, CHICKEN 4 has good support for static compilation. In practice, egg authors would not include the necessary commands for building their libraries statically. Most people don't have a real need for static linking, which means they tend not to make an effort to support it just in case someone else might need it.

The upshot of this was that you could only really compile programs statically when they didn't use any eggs, or if you created a custom build script that would compile the eggs manually with the required -static option. With the new chicken-install, you get static compilation support automatically, for free.

Note that in CHICKEN 4, you could also build eggs and programs using the so-called deployment mode. This allowed shipping a program with all its libraries in one directory. This worked quite well if your target platform supported it, but not all platforms did. Static compilation covers all the use cases that deployment supported and works reliably on all platforms, so we decided to drop deployment mode with all the complexity it brings.

Other noteworthy things

But wait, there's more!

  • Code generation is now fully deterministic, making builds reproducible. This allows you to verify that any given file of generated C code corresponds to the Scheme source code by recompiling it with the same CHICKEN version, both for user code and for CHICKEN core itself. As an added bonus, because the generated C output is deterministic, ccache can be used to get much faster builds (before, it would invalidate the cache as each file would be different).
  • We've improved how symbols are garbage collected, which was optional and somewhat broken in CHICKEN 4. This will speed up code that generates many symbols, and stops symbol table stuffing attacks from being a threat.
  • We have removed quite a bit of bloat: The srfi-1, srfi-13, srfi-14, srfi-69 and srfi-18 libraries have been removed from core! Not to worry though; they are now available as eggs. This will both allow faster development and encourage innovation and competition from alternatives to these non-essential libraries (especially R7RS-large seems to be geared towards renewal of some of these). We've also moved several non-SRFI procedures from core: object-evict, compile-file, binary-search, procedures for dealing with queues, scan-input-lines and POSIX group-information have all been moved to eggs. Support for SWIG has been removed, as it was bit-rotting and nobody seemed to be using it anyway.
  • Ports can now be bi-directional, so there's no more unnecessary distinction between input-ports and output ports. This maps more cleanly to file descriptor semantics, which can also be opened for both reading and writing.
  • Random number generation has been completely replaced. Before, we used libc's rand(), which produces very low quality random numbers. CHICKEN 5 uses the WELL512 PRNG to generate random integers, and it provides access to the system entropy pool for generating cryptographically secure streams of random bytes (using /dev/urandom on *nix, and RtlGenRandom on Windows).

Conclusion

There's a lot to like about the new CHICKEN, so go ahead and give it a spin! Release candidate 1 was made available today for you to try. The full list of changes can of course be found in the NEWS file. If you're already a happy CHICKEN 4 user, we've created a porting guide for you, to make it easier to make the transition from 4 to 5. If you need more help, you can of course contact the always friendly CHICKEN community.

Flattr this!  Bitcoin  (why?)

Now that we have covered the most important algorithms, it's time to take a look at the internal data representation of extended numerals. This will be the final part in this series of posts.

Ratnums and cplxnums

Recall from my data representations article that fixnums are represented as immediate values, directly within a machine word. Flonums are represented in boxed form, by putting them in an opaque bytevector-like structure.

The data representations of complex numbers and rational numbers are pretty simple. Each have their own type tag, and they both contain two slots: the numerator and denominator in case of rational numbers, and the real and imaginary parts in case of complex numbers.

As you can see in the above diagram, the representations of ratnums and cplxnums are very similar. In the example, the slots contain just fixnums. Rational numbers are the simplest here: they can only contain integers (bignums or fixnums). Complex numbers can consist of any number type except other complex numbers, but the exactness of the real and imaginary components must match. This means you can't have 1.5+2/3i, for example.

In its most complex (haha!) form, a complex number contains a rational number in both the real and the imaginary parts, and these rational numbers both contain bignums as their numerator and denominator. In this situation, the entire complex number takes up a whopping minimum of 29 machine words: 3 words for the wrapper complex number, 2 times 3 words for each of the ratnums, and 4 times 5 words for the bignums inside the ratnums.

We'll now look into why bignums require at least 5 words.

Bignums

Initially I tried to represent bignums as a kind of opaque bytevector, much like how flonums are stored. Memory-wise this is the best representation as it has no unnecessary overhead: only 2 extra words; one for the header and one for the sign. On a 32-bit machine it would look like this:

This representation is somewhat wasteful, because it uses a full machine word to represent the sign, which is only one bit of information! This is done to keep the bignum digits word-aligned, which is important for performance. The sign could be shoved into the header if we really wanted to be frugal on memory, but doing so would also complicate type detection. Alternatively, we could store the bignum's digits in 2s complement form so the sign is simply the high bit of the top digit, but that complicates several algorithms.

Regarding the "bytevector" part: because the limbs are word-aligned, it makes more sense to represent the size in words rather than bytes. Unfortunately, there's no way to do this with the current data representation of CHICKEN. This was the direct cause of the following bug: Someone tried to represent the largest known prime number in CHICKEN, and it failed to behave correctly because we didn't have enough header bits to represent its size. This was just for fun, so no harm was done, but when someone will actually need such numbers in practice, they're out of luck. One of these days we're going to have to tackle this problem...

Performance takes a hit

When I first integrated the "numbers" egg into CHICKEN 5, I also did some benchmarking. It turned out that my initial version made some benchmarks up to 8 times slower, though on average it would slow things down by a factor of 2. As pointed out by Alex Shinn and Felix Winkelmann, the reason it impacts some benchmarks so badly has to do with allocation.

Let's compile a loop going from zero to n, like so:

;; Very silly code, calculates 100 * 100 in a stupid way
(let lp ((i 0))
  (if (= i 100)
      (* i i)
      (lp (add1 i))))

Originally, in CHICKEN 4 without the full numeric tower, the compiled C code looked like this:

/* lp in k207 in k204 in k201 */
static void C_fcall f_220(C_word t0,C_word t1,C_word t2){
  C_word tmp;
  C_word t3;
  C_word t4;
  C_word t5;
  C_word t6;
  C_word *a;
loop:
  C_check_for_interrupt;
  if(!C_demand(C_calculate_demand(4, 0, 2))) {
    C_save_and_reclaim_args((void *)trf_220, 3, t0, t1, t2);
  }
  a=C_alloc(4); /* Allocate flonum for overflow situation */
  if(C_truep(C_i_nequalp(t2, C_fix(100)))) {
    t3=t1;
    {
      C_word av2[2];
      av2[0] = t3;
      av2[1] = C_a_i_times(&a, 2, t2, t2);
      ((C_proc)(void*)(*((C_word*)t3+1)))(2, av2);
    }
  } else {
    t3 = C_a_i_plus(&a, 2, t2, C_fix(1));
    C_trace("test.scm:4: lp");
    t5=t1;
    t6=t3;
    t1=t5;
    t2=t6;
    goto loop;
  }
}

It's not much to look at, but this is very close to optimal code: It's a C loop, which allocates a fixed size of memory from the stack/nursery into which it can write the result of + or *, in case they would overflow.

The compiler knows how it can compile + and * to "inlineable" C functions. Many of the most performance-critical functions are built into the compiler like that. But because the compiler (currently) doesn't perform range analysis, it's not smart enough to figure out that none of these operators in this example can cause an overflow. This bites us especially hard when introducing bignums: because we need to assume that any operator may overflow, we must be able to allocate a bignum. And assuming the result of these operators may be bignums, the next iteration of the loop is a bignum. Adding two bignums of unknown sizes together results in another bignum of unknown size.

Because of the above, we can't pre-allocate in a tight C loop. Instead, we must split our loop in two. This is needed to allow the garbage collector to kick in: if you'll recall from the garbage collector post, we need a continuation both for liveness analysis and as a place to jump back to after GC.

One part of our lp calls an allocating procedure, wrapping up the other part in a continuation:

/* k251 in lp in k223 in k220 in k217 in k214 (first part of our "lp") */
static void C_ccall f_253(C_word c, C_word *av) {
  C_word tmp;
  C_word t0 = av[0];
  C_word t1 = av[1];
  C_word t2;
  C_word *a;
  C_check_for_interrupt;
  if(!C_demand(C_calculate_demand(0, c, 2))) {
    C_save_and_reclaim((void *)f_253, 2, av);
  }
  C_trace("test.scm:6: lp");
  t2 = ((C_word*)((C_word*)t0)[2])[1];
  f_236(t2, ((C_word*)t0)[3], t1);
}

/* lp in k223 in k220 in k217 in k214 (second part of our "lp") */
static void C_fcall f_236(C_word t0, C_word t1, C_word t2){
  C_word tmp;
  C_word t3;
  C_word *a;
  C_check_for_interrupt;
  if(!C_demand(C_calculate_demand(4, 0, 3))) {
    C_save_and_reclaim_args((void *)trf_236, 3, t0, t1, t2);
  }
  a=C_alloc(4);
  if(C_truep(C_i_nequalp(t2, C_fix(100)))) {
    C_trace("test.scm:5: *");
    {
      C_word av2[4];
      av2[0] = C_SCHEME_UNDEFINED;
      av2[1] = t1;
      av2[2] = t2;
      av2[3] = t2;
      C_2_basic_times(4,av2);
    }
  } else {
    /* Allocate continuation of (add1 i), which is the first part (f_253) */
    t3=(*a = C_CLOSURE_TYPE|3, a[1] = (C_word)f_253,
        a[2] = ((C_word*)t0)[2], a[3] = t1, tmp = (C_word)a, a += 4, tmp);
    C_trace("test.scm:6: add1");
    {
      C_word av2[4];
      av2[0] = C_SCHEME_UNDEFINED;
      av2[1] = t3;
      av2[2] = t2;
      av2[3] = C_fix(1);
      C_2_basic_plus(4,av2);
    }
  }
}

As you can imagine, allocating a continuation on the stack every time is pretty heavy, and function calling isn't as cheap as a goto loop either. The first part of the loop doesn't even do anything. It just acts as a continuation to be received by the plus call. You can probably imagine how terrible the code would look if we compiled something like (/ (* (+ a b) (+ c d)) 2). That's at least 4 continuations, instead of a few simple statements.

For this reason, my patch was rejected (and rightly so!). The message was clear: code that doesn't use bignums should never pay a performance penalty just because bignums exist.

In order to fix this situation, I had to come up with a radical change to how bignums worked, or face the possibility that a full numeric tower would not make it into CHICKEN 5.

Adding a new "scratch space" memory region

If we want to make the extended numeric operators as fast as the originals, we must be able to inline them. This prevents garbage collection, because we don't get the continuation for an inlined call. But what if they allocate some unknown quantity of memory? We can't allocate on the stack or heap, because that could cause either to fill up, requiring a GC.

So, the obvious solution is to allocate these objects elsewhere. A separate memory space in which bignums can be stored. But what if that space fills up? Don't we need to initiate a GC then? But this is where we're in luck: bignums are not compound objects! They are huge slabs of opaque data, much like strings. Because they can't refer to other objects, we are dealing with a simplified garbage collection problem: only the objects pointing to a bignum need to be updated.

Unfortunately, finding all the live objects that point to a bignum would be quite difficult. Luckily, like many problems in computer science, this can be easily solved by adding another level of indirection. While we're calling inline functions, we can allocate small objects on the stack, which will remain there, never moving until the next GC. We can use this to our advantage: whenever a bignum is needed, we allocate a fixed-size wrapper object on the stack. This object points into the scratch space, where the actual bignum data lives. See the following diagram:

In the diagram, we have a bignum representing the number 24386824307922, which we've put in a list and a vector, and we also have the rational number 1/24386824307922, which refers to the same bignum in its denominator. All these objects can be on the stack or on the heap. We have no control over them; the user can set any object slot to hold the bignum. We do have control over the wrapper object, and only the wrapper object directly points into scratch space. Because bignums are opaque objects in Scheme, the wrapper is invisible. Thus, user code is (in principle) unable to access the wrapper's data slot, so there will be no direct references to the bignum data portion. This means we're free to move it around without updating anything but the wrapper object's slot.

Note that in the scratch space, we also store a back-pointer to the wrapper object's slot. This allows us to update the wrapper object after moving its matching bignum data blob around. This way, we can reallocate the scratch space when more space is needed.

Some of the native functions like Karatsuba multiplication or Burnikel-Ziegler division generate many temporary values. All such hand-written code has been tuned to erase a bignum's back-pointer when that bignum is no longer needed. It makes the code quite a bit hairier, but it allows (limited) garbage collection to be done when reallocating the scratch space.

With this setup, all numeric operations only need to allocate memory to hold a bignum wrapper object. This is a fixed size, much like in CHICKEN 4, and it means numeric operations can once again be inlined!

Oh, and why a bignum takes up 5 words? Well, sometimes we know that a procedure receives 2 fixnums. In that case, we can pre-allocate a bignum for the case when it overflows. Because we know in its maximum size in advance, there's no need for the scratch space; we can just allocate it in the nursery. For uniformity reasons, such a bignum still requires a wrapper object (2 words) and a bignum data blob (3 words: its header, the sign and one limb). This sounds complicated, but it shortens the specialised code for two fixnums, and allocating only in the nursery is also faster.

Some parting thoughts

Adding full numeric tower support has been extremely educational for me. I'm not really a math person, but having a clear goal like this motivated me to dive in deep into the literature. Overall, I'm happy with how it turned out, but there are always improvements.

For example, instead of doing everything in C it would (of course!) be preferable to do it all in Scheme. Unfortunately, CHICKEN's design makes it hard to do this in an efficient way: it's currently impossible to export Scheme procedures that can be called inline (i.e., non-CPS calls) without copying their full code into the call site. If we can find a way, it would be possible to do 90% of the algorithms in Scheme. The principle on which this would be based can be found in an old paper about the Lucid Common Lisp implementation. Basically, you implement a handful of primitives natively, and everything else can be done in Lisp. For example, SBCL is implemented this way too.

As far as I can tell, of the more popular Scheme implementations, Gambit is the only one that actually does this. I've been very impressed with Gambit in general. Besides having pretty readable Scheme code for bignum algorithms, Gambit has some superior bignum algorithms, most get close to (and in rare cases even surpass) GMP performance. This is mostly due to the hard work of Bradley Lucier, a numerical analyst who has also provided useful feedback on some of my work on the numbers egg, and this series of blog posts. He really knows this stuff! Most other Scheme implementations are in C and still pretty slow due to the algorithms they use, unless of course they use GMP.

In CHICKEN, there is a lot of room for optimisations. But I also think we shouldn't try to implement every algorithm under the sun. Things should generally be fast enough to serve the most common cases. Typical code doesn't use bignums, and if it does it's only small bignums (for instance, when using 64-bit integers with the FFI), which is why I think we should optimise for these cases. For example, my implementations of Karatsuba and Burnikel-Ziegler aren't great, so if anyone feels like having a stab at improving these things we already have (or simply replacing them with a better algorithm), please do!

References

Flattr this!  Bitcoin  (why?)

In this instalment of the blog series, we'll take a look at how string->number and number->string are implemented for bignums.

Performing base conversions

Performing calculations is all nice and useful, but you eventually want to print the results back to the user. And of course the user needs a way to enter such large numbers into the system. So, converting between numbers and strings is an essential operation. The Scheme standard requires support for conversion between bases 2, 8, 10 and 16, but many practical implementations support conversion between arbitrary bases, and why not? It doesn't really require more effort.

Converting strings to numbers

Let's start with the simpler of the two directions: string->number. The naive way of converting a string in base n to a number is to scan the string from left to right (high to low), adding the digit to the result and multiplying the result by n:

/* "result" is a pre-allocated, zeroed out bignum of the right size */
while (*str != '0') /* Assuming NUL-terminated string */
{
  int digit = hex_char_to_digit((int)*str++);
  bignum_destructive_scale_up(result, radix);
  bignum_destructive_add(result, digit);
}

This is very simple and elegant, but also quite slow. The Scheme48 code also checks for invalid characters in this loop, while in CHICKEN, the reader performs this check. So, when converting a digit stream to a bignum, we already know the digit stream contains only valid digits.

A simple improvement can be made to this algorithm: we can avoid traversing the entire bignum for every string digit. Instead, we collect multiple digits in a register until we fill up a halfword. Then, we scale up the bignum, adding the collected halfword:

do {
  C_word big_digit = 0;  /* Collected digit */
  C_word factor = radix; /* Multiplication factor for bignum */

  /* Keep collecting from str while factor fits a half-digit */
  while (str < str_end && C_fitsinbignumhalfdigitp(factor)) {
    str_digit = hex_char_to_digit((int)*str++);
    factor *= radix;
    big_digit = radix * big_digit + str_digit;
  }

  /* Scaling up with carry avoids traversing bignum twice */
  big_digit = bignum_digits_destructive_scale_up_with_carry(
                digits, last_digit, factor / radix, big_digit);

  if (big_digit) { /* If there was a carry, increment size of bignum */
    (*last_digit++) = big_digit;
  }
} while (str < str_end);

Remember the previous post in this series? Now you know where the design behind bignum_digits_destructive_scale_up_with_carry comes from: we can scale up the bignum with an initial "carry" value that's our collected digit. The return value is the resulting carry (if any), so we know when to increase the bignum's size by moving the last_digit pointer. This pointer makes it easier to detect the final length of the bignum, which can be hard to predict precisely. We can't predict the exact size, but we can calculate the maximum size. Because we allocate this maximum size, this moving of the end pointer is safe.

If the string's base is a power of two, we can perform an even better optimisation: We don't need to multiply or add to the bignum, we can just write straight to the bignum's digits!

int radix_shift = C_ilen(radix) - 1;  /* Integer length (the power of two) */
C_uword big_digit = 0;       /* The current bignum digit being constructed */
int n = 0;                /* The number of bits read so far into big_digit */

/* Read from least to most significant digit.  This is much easier! */
while (str_end > str_start) {
  str_digit = hex_char_to_digit((int)*--str_end);

  big_digit |= (C_uword)str_digit << n;
  n += radix_shift;                             /* Processed n bits so far */

  if (n >= C_BIGNUM_DIGIT_LENGTH) {             /* Filled up the digit? */
    n -= C_BIGNUM_DIGIT_LENGTH;                 /* Number of remainder bits */
    *digits++ = big_digit;
    big_digit = str_digit >> (radix_shift - n); /* Keep only the remainder */
  }
}
/* If radix isn't an exact divisor of digit length, write final remainder */
if (n > 0) *digits++ = big_digit;

From my benchmarks it looks like CHICKEN's string->number implementation is among the fastest of the popular Scheme implementations for power of two bases, due to this bit banging loop.

Converting numbers to strings

The naive way of converting a number to a string in base n is to do the opposite of converting a string in base n to a number: we repeatedly divide by the target base and prepend this number to the string.

char *characters = "0123456789abcdef";

/* This fills the "buf" array *back to front*, so index starts at the
 * end of "buf".  counter represents # of characters written.
 */
do {
  digit = bignum_destructive_scale_down(working_copy, radix);
  *index-- = characters[digit];

  /* If we reached the current string's length, reallocate in
   * increments of BIGNUM_STR_BLOCK_SIZE.
   */
  if (++counter == len) {
   char *newbuf = C_malloc(len + BIGNUM_STR_BLOCK_SIZE);
   if (newbuf == NULL) return ERR;

   C_memcpy(newbuf + BIGNUM_STR_BLOCK_SIZE, buf, len);
   C_free(buf);
   buf = newbuf;
   index = newbuf + BIGNUM_STR_BLOCK_SIZE - 1;
   len += BIGNUM_STR_BLOCK_SIZE;
} while(bignum_length(working_copy) > 0);

This is the original version as provided by Scheme48. Again, it's very short and clean. It operates on a copy of the bignum, which it destructively scales down by dividing it by radix. The remainder digit is written to the string after conversion to a character. Many implementations use this algorithm, but it can be improved pretty easily, in basically the same way we improved the reverse operation. Instead of dividing by the radix on ever loop iteration, you can chop off a big lump. This is the remainder of dividing by a large number. Then, you divide this remainder in a loop while emitting string digits until you hit zero, then repeat until the bignum is zero.

Another improvement over the Scheme48 code is that you can pre-calculate a (pessimistic) upper bound on the number of digits, so you can avoid the reallocation (which implies a memory copy). For powers of two, this can be done precisely. For other radixes you can shorten the buffer only once at the end, and only if it turns out to be necessary.

This is really very simple, except for the finishing up part, where we shorten the buffer:

int steps;
C_uword base;         /* This is the "lump" we cut off every time (divisor) */
C_uword *scan = start + C_bignum_size(bignum); /* Start scanning at the end */

/* Calculate the largest power of radix (string base) that fits a halfdigit.
 * If radix is 10, steps = log10(2^halfdigit_bits), base = 10^steps
 */
for(steps = 0, base = radix; C_fitsinbignumhalfdigitp(base); base *= radix)
  steps++;

base /= radix; /* Back down: we always overshoot in the loop by one step */

while (scan > start) {
  /* Divide by base. This chops "steps" string digits off of the bignum */
  big_digit = bignum_digits_destructive_scale_down(start, scan, base);

  if (*(scan-1) == 0) scan--; /* Adjust if we exhausted the highest digit */

  for(i = 0; i < steps && index >= buf; ++i) {      /* Emit string digits */
    C_uword tmp = big_digit / radix;
    *index-- = characters[big_digit - (tmp*radix)]; /* big_digit % radix */
    big_digit = tmp;
  }
}

/* Move index onto first nonzero digit.  We're writing a bignum
   here: it can't consist of only zeroes. */
while(*++index == '0');

if (negp) *--index = '-';

/* Shorten with distance between start and index. */
if (buf != index) {
  i = C_header_size(string) - (index - buf);
  C_memmove(buf, index, i); /* Move start of number to beginning. */
  C_block_header(string) = C_STRING_TYPE | i; /* Mutate strlength. */
}

Finally, if the radix is a power of two, we can do a straight bit-to-bit extraction like we did with string->number:

int radix_shift = C_ilen(radix) - 1;    /* Integer length (the power of two) */
int radix_mask = radix - 1;              /* Bitmask of N-1 ones (radix = 2ᴺ) */

/* Again, we go from least significant to most significant digit */
while (scan < end) {
  C_uword big_digit = *scan++;
  int big_digit_len = C_BIGNUM_DIGIT_LENGTH;

  while(big_digit_len > 0 && index >= buf) {
    int radix_digit = big_digit & radix_mask;    /* Extract one string digit */
    *index-- = characters[radix_digit]; /* Write it (as character) to string */
    big_digit >>= radix_shift;                            /* Drop this digit */
    big_digit_len -= radix_shift;
  }
}

Unfortunately, there are some caveats that make this slightly trickier than you would guess. The above code is simplified, and only works for some radixes. If your radix is 2ⁿ, and your base digit size is 2ᵐ bits, then this code works if m is a multiple of n. Otherwise, you'll need to take care of overlaps. Octal numbers are the biggest problem here, because they're 3 bits per string digit and the bignum digit sizes of 32 or 64 bits don't divide cleanly by 3.

Getting this right complicates the algorithm enough to make it slightly too hairy to present here (there's some more shifting and if checks involved). If you're interested how to handle this, you can always study the CHICKEN sources.

Divide and conquer

Because of the many divisions, number->string is much slower than string->number. Luckily, we can speed up the former by relying once again on a recursive divide and conquer style algorithm.

This requires you to know the string's expected size in the target base, and will divide the number by half that. For example, if you wish to convert the number 12345678 to a decimal string, you can decide to split it in two. If you had a perfect guess of the string's length (which is 8), you can split the expected string in two halves by dividing the number by 10⁴, or 10000, giving us the quotient 1234 and the remainder 5678. These can recursively be converted to a string and finally appended together. Note that if you have a pessimistic upper limit of the final string length, it'll be slower, but will still produce a correct result. The code for this is quite straightforward:

(define (integer->string/recursive n base expected-string-size)
  (let*-values (((halfsize) (fxshr (fx+ expected-string-size 1) 1))
                ((b^M/2) (integer-power base halfsize))
                ((hi lo) (quotient&remainder n b^M/2))
                ((strhi) (number->string hi base))
                ((strlo) (number->string (abs lo) base)))
    (string-append strhi
                   ;; Fix up any leading zeroes that were stripped from strlo
                   (make-string (fx- halfsize (string-length strlo)) #\0)
                   strlo)))

Because the number 120034, when split in two, generates 120 and 34, we need the make-string call to add back the leading zeroes, otherwise we would get 12034 as the output. This can be omitted if you have a more low-level number->string implementation which doesn't truncate leading zeroes.

While I was researching this, I found out about a technique called the "scaled remainder tree" algorithm. This algorithm is supposedly even faster than the simple recursive algorithm I just showed you. Unfortunately, I was unable to wrap my head around it. Maybe you will have better luck!

Reading list

There doesn't seem to be that much information on how to efficiently perform base conversion, even though there are quite a few clever implementation techniques out there. If you spend the time searching, you'll be sure to find some gems.

  • The y-cruncher page on radix conversion is full of ideas. Y-cruncher is a program to calculate digits of pi by efficiently using multiple CPU cores. This page has several algorithms, including recursive string splitting and scaled remainder tree. Unfortunately, the program itself is proprietary so you can't study its implementations of these algorithms.
  • Modern Computer Arithmetic, which has quickly become my favourite bignum textbook, has an interesting alternative string to number technique based on iterative multiplication of the input string. I still want to try that out some day. It also hints at scaled remainder tree for number to string conversion, but doesn't explain how it works.
  • Division-Free Binary-to-Decimal Conversion by Cyril Bouvier and Paul Zimmermann explains a number to string conversion based on the scaled remainder tree. Unfortunately, this paper only confused me.
  • It looks like Gambit has a recursive divide-and-conquer implementation of string->number that's slightly different from ours. Their recursive number->string implementation is rather interesting too.
Flattr this!  Bitcoin  (why?)

Older articles...
Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution 3.0 License. All code fragments on this site are hereby put in the public domain.