About this blog

Hi, I'm Peter Bex, a Scheme and free software enthusiast from the Netherlands. See my user page on the CHICKEN wiki or my git server for some of my projects.


Are you in need of support for CHICKEN Scheme? Or maybe you want clear tech writing like on this very blog? Then you're in luck! I am now available as a freelance consultant!

The 3 most recent posts (archive) Atom feed

Now that we have covered the most important algorithms, it's time to take a look at the internal data representation of extended numerals. This will be the final part in this series of posts.

Ratnums and cplxnums

Recall from my data representations article that fixnums are represented as immediate values, directly within a machine word. Flonums are represented in boxed form, by putting them in an opaque bytevector-like structure.

The data representations of complex numbers and rational numbers are pretty simple. Each have their own type tag, and they both contain two slots: the numerator and denominator in case of rational numbers, and the real and imaginary parts in case of complex numbers.

As you can see in the above diagram, the representations of ratnums and cplxnums are very similar. In the example, the slots contain just fixnums. Rational numbers are the simplest here: they can only contain integers (bignums or fixnums). Complex numbers can consist of any number type except other complex numbers, but the exactness of the real and imaginary components must match. This means you can't have 1.5+2/3i, for example.

In its most complex (haha!) form, a complex number contains a rational number in both the real and the imaginary parts, and these rational numbers both contain bignums as their numerator and denominator. In this situation, the entire complex number takes up a whopping minimum of 29 machine words: 3 words for the wrapper complex number, 2 times 3 words for each of the ratnums, and 4 times 5 words for the bignums inside the ratnums.

We'll now look into why bignums require at least 5 words.

Bignums

Initially I tried to represent bignums as a kind of opaque bytevector, much like how flonums are stored. Memory-wise this is the best representation as it has no unnecessary overhead: only 2 extra words; one for the header and one for the sign. On a 32-bit machine it would look like this:

This representation is somewhat wasteful, because it uses a full machine word to represent the sign, which is only one bit of information! This is done to keep the bignum digits word-aligned, which is important for performance. The sign could be shoved into the header if we really wanted to be frugal on memory, but doing so would also complicate type detection. Alternatively, we could store the bignum's digits in 2s complement form so the sign is simply the high bit of the top digit, but that complicates several algorithms.

Regarding the "bytevector" part: because the limbs are word-aligned, it makes more sense to represent the size in words rather than bytes. Unfortunately, there's no way to do this with the current data representation of CHICKEN. This was the direct cause of the following bug: Someone tried to represent the largest known prime number in CHICKEN, and it failed to behave correctly because we didn't have enough header bits to represent its size. This was just for fun, so no harm was done, but when someone will actually need such numbers in practice, they're out of luck. One of these days we're going to have to tackle this problem...

Performance takes a hit

When I first integrated the "numbers" egg into CHICKEN 5, I also did some benchmarking. It turned out that my initial version made some benchmarks up to 8 times slower, though on average it would slow things down by a factor of 2. As pointed out by Alex Shinn and Felix Winkelmann, the reason it impacts some benchmarks so badly has to do with allocation.

Let's compile a loop going from zero to n, like so:

;; Very silly code, calculates 100 * 100 in a stupid way
(let lp ((i 0))
  (if (= i 100)
      (* i i)
      (lp (add1 i))))

Originally, in CHICKEN 4 without the full numeric tower, the compiled C code looked like this:

/* lp in k207 in k204 in k201 */
static void C_fcall f_220(C_word t0,C_word t1,C_word t2){
  C_word tmp;
  C_word t3;
  C_word t4;
  C_word t5;
  C_word t6;
  C_word *a;
loop:
  C_check_for_interrupt;
  if(!C_demand(C_calculate_demand(4, 0, 2))) {
    C_save_and_reclaim_args((void *)trf_220, 3, t0, t1, t2);
  }
  a=C_alloc(4); /* Allocate flonum for overflow situation */
  if(C_truep(C_i_nequalp(t2, C_fix(100)))) {
    t3=t1;
    {
      C_word av2[2];
      av2[0] = t3;
      av2[1] = C_a_i_times(&a, 2, t2, t2);
      ((C_proc)(void*)(*((C_word*)t3+1)))(2, av2);
    }
  } else {
    t3 = C_a_i_plus(&a, 2, t2, C_fix(1));
    C_trace("test.scm:4: lp");
    t5=t1;
    t6=t3;
    t1=t5;
    t2=t6;
    goto loop;
  }
}

It's not much to look at, but this is very close to optimal code: It's a C loop, which allocates a fixed size of memory from the stack/nursery into which it can write the result of + or *, in case they would overflow.

The compiler knows how it can compile + and * to "inlineable" C functions. Many of the most performance-critical functions are built into the compiler like that. But because the compiler (currently) doesn't perform range analysis, it's not smart enough to figure out that none of these operators in this example can cause an overflow. This bites us especially hard when introducing bignums: because we need to assume that any operator may overflow, we must be able to allocate a bignum. And assuming the result of these operators may be bignums, the next iteration of the loop is a bignum. Adding two bignums of unknown sizes together results in another bignum of unknown size.

Because of the above, we can't pre-allocate in a tight C loop. Instead, we must split our loop in two. This is needed to allow the garbage collector to kick in: if you'll recall from the garbage collector post, we need a continuation both for liveness analysis and as a place to jump back to after GC.

One part of our lp calls an allocating procedure, wrapping up the other part in a continuation:

/* k251 in lp in k223 in k220 in k217 in k214 (first part of our "lp") */
static void C_ccall f_253(C_word c, C_word *av) {
  C_word tmp;
  C_word t0 = av[0];
  C_word t1 = av[1];
  C_word t2;
  C_word *a;
  C_check_for_interrupt;
  if(!C_demand(C_calculate_demand(0, c, 2))) {
    C_save_and_reclaim((void *)f_253, 2, av);
  }
  C_trace("test.scm:6: lp");
  t2 = ((C_word*)((C_word*)t0)[2])[1];
  f_236(t2, ((C_word*)t0)[3], t1);
}

/* lp in k223 in k220 in k217 in k214 (second part of our "lp") */
static void C_fcall f_236(C_word t0, C_word t1, C_word t2){
  C_word tmp;
  C_word t3;
  C_word *a;
  C_check_for_interrupt;
  if(!C_demand(C_calculate_demand(4, 0, 3))) {
    C_save_and_reclaim_args((void *)trf_236, 3, t0, t1, t2);
  }
  a=C_alloc(4);
  if(C_truep(C_i_nequalp(t2, C_fix(100)))) {
    C_trace("test.scm:5: *");
    {
      C_word av2[4];
      av2[0] = C_SCHEME_UNDEFINED;
      av2[1] = t1;
      av2[2] = t2;
      av2[3] = t2;
      C_2_basic_times(4,av2);
    }
  } else {
    /* Allocate continuation of (add1 i), which is the first part (f_253) */
    t3=(*a = C_CLOSURE_TYPE|3, a[1] = (C_word)f_253,
        a[2] = ((C_word*)t0)[2], a[3] = t1, tmp = (C_word)a, a += 4, tmp);
    C_trace("test.scm:6: add1");
    {
      C_word av2[4];
      av2[0] = C_SCHEME_UNDEFINED;
      av2[1] = t3;
      av2[2] = t2;
      av2[3] = C_fix(1);
      C_2_basic_plus(4,av2);
    }
  }
}

As you can imagine, allocating a continuation on the stack every time is pretty heavy, and function calling isn't as cheap as a goto loop either. The first part of the loop doesn't even do anything. It just acts as a continuation to be received by the plus call. You can probably imagine how terrible the code would look if we compiled something like (/ (* (+ a b) (+ c d)) 2). That's at least 4 continuations, instead of a few simple statements.

For this reason, my patch was rejected (and rightly so!). The message was clear: code that doesn't use bignums should never pay a performance penalty just because bignums exist.

In order to fix this situation, I had to come up with a radical change to how bignums worked, or face the possibility that a full numeric tower would not make it into CHICKEN 5.

Adding a new "scratch space" memory region

If we want to make the extended numeric operators as fast as the originals, we must be able to inline them. This prevents garbage collection, because we don't get the continuation for an inlined call. But what if they allocate some unknown quantity of memory? We can't allocate on the stack or heap, because that could cause either to fill up, requiring a GC.

So, the obvious solution is to allocate these objects elsewhere. A separate memory space in which bignums can be stored. But what if that space fills up? Don't we need to initiate a GC then? But this is where we're in luck: bignums are not compound objects! They are huge slabs of opaque data, much like strings. Because they can't refer to other objects, we are dealing with a simplified garbage collection problem: only the objects pointing to a bignum need to be updated.

Unfortunately, finding all the live objects that point to a bignum would be quite difficult. Luckily, like many problems in computer science, this can be easily solved by adding another level of indirection. While we're calling inline functions, we can allocate small objects on the stack, which will remain there, never moving until the next GC. We can use this to our advantage: whenever a bignum is needed, we allocate a fixed-size wrapper object on the stack. This object points into the scratch space, where the actual bignum data lives. See the following diagram:

In the diagram, we have a bignum representing the number 24386824307922, which we've put in a list and a vector, and we also have the rational number 1/24386824307922, which refers to the same bignum in its denominator. All these objects can be on the stack or on the heap. We have no control over them; the user can set any object slot to hold the bignum. We do have control over the wrapper object, and only the wrapper object directly points into scratch space. Because bignums are opaque objects in Scheme, the wrapper is invisible. Thus, user code is (in principle) unable to access the wrapper's data slot, so there will be no direct references to the bignum data portion. This means we're free to move it around without updating anything but the wrapper object's slot.

Note that in the scratch space, we also store a back-pointer to the wrapper object's slot. This allows us to update the wrapper object after moving its matching bignum data blob around. This way, we can reallocate the scratch space when more space is needed.

Some of the native functions like Karatsuba multiplication or Burnikel-Ziegler division generate many temporary values. All such hand-written code has been tuned to erase a bignum's back-pointer when that bignum is no longer needed. It makes the code quite a bit hairier, but it allows (limited) garbage collection to be done when reallocating the scratch space.

With this setup, all numeric operations only need to allocate memory to hold a bignum wrapper object. This is a fixed size, much like in CHICKEN 4, and it means numeric operations can once again be inlined!

Oh, and why a bignum takes up 5 words? Well, sometimes we know that a procedure receives 2 fixnums. In that case, we can pre-allocate a bignum for the case when it overflows. Because we know in its maximum size in advance, there's no need for the scratch space; we can just allocate it in the nursery. For uniformity reasons, such a bignum still requires a wrapper object (2 words) and a bignum data blob (3 words: its header, the sign and one limb). This sounds complicated, but it shortens the specialised code for two fixnums, and allocating only in the nursery is also faster.

Some parting thoughts

Adding full numeric tower support has been extremely educational for me. I'm not really a math person, but having a clear goal like this motivated me to dive in deep into the literature. Overall, I'm happy with how it turned out, but there are always improvements.

For example, instead of doing everything in C it would (of course!) be preferable to do it all in Scheme. Unfortunately, CHICKEN's design makes it hard to do this in an efficient way: it's currently impossible to export Scheme procedures that can be called inline (i.e., non-CPS calls) without copying their full code into the call site. If we can find a way, it would be possible to do 90% of the algorithms in Scheme. The principle on which this would be based can be found in an old paper about the Lucid Common Lisp implementation. Basically, you implement a handful of primitives natively, and everything else can be done in Lisp. For example, SBCL is implemented this way too.

As far as I can tell, of the more popular Scheme implementations, Gambit is the only one that actually does this. I've been very impressed with Gambit in general. Besides having pretty readable Scheme code for bignum algorithms, Gambit has some superior bignum algorithms, most get close to (and in rare cases even surpass) GMP performance. This is mostly due to the hard work of Bradley Lucier, a numerical analyst who has also provided useful feedback on some of my work on the numbers egg, and this series of blog posts. He really knows this stuff! Most other Scheme implementations are in C and still pretty slow due to the algorithms they use, unless of course they use GMP.

In CHICKEN, there is a lot of room for optimisations. But I also think we shouldn't try to implement every algorithm under the sun. Things should generally be fast enough to serve the most common cases. Typical code doesn't use bignums, and if it does it's only small bignums (for instance, when using 64-bit integers with the FFI), which is why I think we should optimise for these cases. For example, my implementations of Karatsuba and Burnikel-Ziegler aren't great, so if anyone feels like having a stab at improving these things we already have (or simply replacing them with a better algorithm), please do!

References

Flattr this!  Bitcoin  (why?)

In this instalment of the blog series, we'll take a look at how string->number and number->string are implemented for bignums.

Performing base conversions

Performing calculations is all nice and useful, but you eventually want to print the results back to the user. And of course the user needs a way to enter such large numbers into the system. So, converting between numbers and strings is an essential operation. The Scheme standard requires support for conversion between bases 2, 8, 10 and 16, but many practical implementations support conversion between arbitrary bases, and why not? It doesn't really require more effort.

Converting strings to numbers

Let's start with the simpler of the two directions: string->number. The naive way of converting a string in base n to a number is to scan the string from left to right (high to low), adding the digit to the result and multiplying the result by n:

/* "result" is a pre-allocated, zeroed out bignum of the right size */
while (*str != '0') /* Assuming NUL-terminated string */
{
  int digit = hex_char_to_digit((int)*str++);
  bignum_destructive_scale_up(result, radix);
  bignum_destructive_add(result, digit);
}

This is very simple and elegant, but also quite slow. The Scheme48 code also checks for invalid characters in this loop, while in CHICKEN, the reader performs this check. So, when converting a digit stream to a bignum, we already know the digit stream contains only valid digits.

A simple improvement can be made to this algorithm: we can avoid traversing the entire bignum for every string digit. Instead, we collect multiple digits in a register until we fill up a halfword. Then, we scale up the bignum, adding the collected halfword:

do {
  C_word big_digit = 0;  /* Collected digit */
  C_word factor = radix; /* Multiplication factor for bignum */

  /* Keep collecting from str while factor fits a half-digit */
  while (str < str_end && C_fitsinbignumhalfdigitp(factor)) {
    str_digit = hex_char_to_digit((int)*str++);
    factor *= radix;
    big_digit = radix * big_digit + str_digit;
  }

  /* Scaling up with carry avoids traversing bignum twice */
  big_digit = bignum_digits_destructive_scale_up_with_carry(
                digits, last_digit, factor / radix, big_digit);

  if (big_digit) { /* If there was a carry, increment size of bignum */
    (*last_digit++) = big_digit;
  }
} while (str < str_end);

Remember the previous post in this series? Now you know where the design behind bignum_digits_destructive_scale_up_with_carry comes from: we can scale up the bignum with an initial "carry" value that's our collected digit. The return value is the resulting carry (if any), so we know when to increase the bignum's size by moving the last_digit pointer. This pointer makes it easier to detect the final length of the bignum, which can be hard to predict precisely. We can't predict the exact size, but we can calculate the maximum size. Because we allocate this maximum size, this moving of the end pointer is safe.

If the string's base is a power of two, we can perform an even better optimisation: We don't need to multiply or add to the bignum, we can just write straight to the bignum's digits!

int radix_shift = C_ilen(radix) - 1;  /* Integer length (the power of two) */
C_uword big_digit = 0;       /* The current bignum digit being constructed */
int n = 0;                /* The number of bits read so far into big_digit */

/* Read from least to most significant digit.  This is much easier! */
while (str_end > str_start) {
  str_digit = hex_char_to_digit((int)*--str_end);

  big_digit |= (C_uword)str_digit << n;
  n += radix_shift;                             /* Processed n bits so far */

  if (n >= C_BIGNUM_DIGIT_LENGTH) {             /* Filled up the digit? */
    n -= C_BIGNUM_DIGIT_LENGTH;                 /* Number of remainder bits */
    *digits++ = big_digit;
    big_digit = str_digit >> (radix_shift - n); /* Keep only the remainder */
  }
}
/* If radix isn't an exact divisor of digit length, write final remainder */
if (n > 0) *digits++ = big_digit;

From my benchmarks it looks like CHICKEN's string->number implementation is among the fastest of the popular Scheme implementations for power of two bases, due to this bit banging loop.

Converting numbers to strings

The naive way of converting a number to a string in base n is to do the opposite of converting a string in base n to a number: we repeatedly divide by the target base and prepend this number to the string.

char *characters = "0123456789abcdef";

/* This fills the "buf" array *back to front*, so index starts at the
 * end of "buf".  counter represents # of characters written.
 */
do {
  digit = bignum_destructive_scale_down(working_copy, radix);
  *index-- = characters[digit];

  /* If we reached the current string's length, reallocate in
   * increments of BIGNUM_STR_BLOCK_SIZE.
   */
  if (++counter == len) {
   char *newbuf = C_malloc(len + BIGNUM_STR_BLOCK_SIZE);
   if (newbuf == NULL) return ERR;

   C_memcpy(newbuf + BIGNUM_STR_BLOCK_SIZE, buf, len);
   C_free(buf);
   buf = newbuf;
   index = newbuf + BIGNUM_STR_BLOCK_SIZE - 1;
   len += BIGNUM_STR_BLOCK_SIZE;
} while(bignum_length(working_copy) > 0);

This is the original version as provided by Scheme48. Again, it's very short and clean. It operates on a copy of the bignum, which it destructively scales down by dividing it by radix. The remainder digit is written to the string after conversion to a character. Many implementations use this algorithm, but it can be improved pretty easily, in basically the same way we improved the reverse operation. Instead of dividing by the radix on ever loop iteration, you can chop off a big lump. This is the remainder of dividing by a large number. Then, you divide this remainder in a loop while emitting string digits until you hit zero, then repeat until the bignum is zero.

Another improvement over the Scheme48 code is that you can pre-calculate a (pessimistic) upper bound on the number of digits, so you can avoid the reallocation (which implies a memory copy). For powers of two, this can be done precisely. For other radixes you can shorten the buffer only once at the end, and only if it turns out to be necessary.

This is really very simple, except for the finishing up part, where we shorten the buffer:

int steps;
C_uword base;         /* This is the "lump" we cut off every time (divisor) */
C_uword *scan = start + C_bignum_size(bignum); /* Start scanning at the end */

/* Calculate the largest power of radix (string base) that fits a halfdigit.
 * If radix is 10, steps = log10(2^halfdigit_bits), base = 10^steps
 */
for(steps = 0, base = radix; C_fitsinbignumhalfdigitp(base); base *= radix)
  steps++;

base /= radix; /* Back down: we always overshoot in the loop by one step */

while (scan > start) {
  /* Divide by base. This chops "steps" string digits off of the bignum */
  big_digit = bignum_digits_destructive_scale_down(start, scan, base);

  if (*(scan-1) == 0) scan--; /* Adjust if we exhausted the highest digit */

  for(i = 0; i < steps && index >= buf; ++i) {      /* Emit string digits */
    C_uword tmp = big_digit / radix;
    *index-- = characters[big_digit - (tmp*radix)]; /* big_digit % radix */
    big_digit = tmp;
  }
}

/* Move index onto first nonzero digit.  We're writing a bignum
   here: it can't consist of only zeroes. */
while(*++index == '0');

if (negp) *--index = '-';

/* Shorten with distance between start and index. */
if (buf != index) {
  i = C_header_size(string) - (index - buf);
  C_memmove(buf, index, i); /* Move start of number to beginning. */
  C_block_header(string) = C_STRING_TYPE | i; /* Mutate strlength. */
}

Finally, if the radix is a power of two, we can do a straight bit-to-bit extraction like we did with string->number:

int radix_shift = C_ilen(radix) - 1;    /* Integer length (the power of two) */
int radix_mask = radix - 1;              /* Bitmask of N-1 ones (radix = 2ᴺ) */

/* Again, we go from least significant to most significant digit */
while (scan < end) {
  C_uword big_digit = *scan++;
  int big_digit_len = C_BIGNUM_DIGIT_LENGTH;

  while(big_digit_len > 0 && index >= buf) {
    int radix_digit = big_digit & radix_mask;    /* Extract one string digit */
    *index-- = characters[radix_digit]; /* Write it (as character) to string */
    big_digit >>= radix_shift;                            /* Drop this digit */
    big_digit_len -= radix_shift;
  }
}

Unfortunately, there are some caveats that make this slightly trickier than you would guess. The above code is simplified, and only works for some radixes. If your radix is 2ⁿ, and your base digit size is 2ᵐ bits, then this code works if m is a multiple of n. Otherwise, you'll need to take care of overlaps. Octal numbers are the biggest problem here, because they're 3 bits per string digit and the bignum digit sizes of 32 or 64 bits don't divide cleanly by 3.

Getting this right complicates the algorithm enough to make it slightly too hairy to present here (there's some more shifting and if checks involved). If you're interested how to handle this, you can always study the CHICKEN sources.

Divide and conquer

Because of the many divisions, number->string is much slower than string->number. Luckily, we can speed up the former by relying once again on a recursive divide and conquer style algorithm.

This requires you to know the string's expected size in the target base, and will divide the number by half that. For example, if you wish to convert the number 12345678 to a decimal string, you can decide to split it in two. If you had a perfect guess of the string's length (which is 8), you can split the expected string in two halves by dividing the number by 10⁴, or 10000, giving us the quotient 1234 and the remainder 5678. These can recursively be converted to a string and finally appended together. Note that if you have a pessimistic upper limit of the final string length, it'll be slower, but will still produce a correct result. The code for this is quite straightforward:

(define (integer->string/recursive n base expected-string-size)
  (let*-values (((halfsize) (fxshr (fx+ expected-string-size 1) 1))
                ((b^M/2) (integer-power base halfsize))
                ((hi lo) (quotient&remainder n b^M/2))
                ((strhi) (number->string hi base))
                ((strlo) (number->string (abs lo) base)))
    (string-append strhi
                   ;; Fix up any leading zeroes that were stripped from strlo
                   (make-string (fx- halfsize (string-length strlo)) #\0)
                   strlo)))

Because the number 120034, when split in two, generates 120 and 34, we need the make-string call to add back the leading zeroes, otherwise we would get 12034 as the output. This can be omitted if you have a more low-level number->string implementation which doesn't truncate leading zeroes.

While I was researching this, I found out about a technique called the "scaled remainder tree" algorithm. This algorithm is supposedly even faster than the simple recursive algorithm I just showed you. Unfortunately, I was unable to wrap my head around it. Maybe you will have better luck!

Reading list

There doesn't seem to be that much information on how to efficiently perform base conversion, even though there are quite a few clever implementation techniques out there. If you spend the time searching, you'll be sure to find some gems.

  • The y-cruncher page on radix conversion is full of ideas. Y-cruncher is a program to calculate digits of pi by efficiently using multiple CPU cores. This page has several algorithms, including recursive string splitting and scaled remainder tree. Unfortunately, the program itself is proprietary so you can't study its implementations of these algorithms.
  • Modern Computer Arithmetic, which has quickly become my favourite bignum textbook, has an interesting alternative string to number technique based on iterative multiplication of the input string. I still want to try that out some day. It also hints at scaled remainder tree for number to string conversion, but doesn't explain how it works.
  • Division-Free Binary-to-Decimal Conversion by Cyril Bouvier and Paul Zimmermann explains a number to string conversion based on the scaled remainder tree. Unfortunately, this paper only confused me.
  • It looks like Gambit has a recursive divide-and-conquer implementation of string->number that's slightly different from ours. Their recursive number->string implementation is rather interesting too.
Flattr this!  Bitcoin  (why?)

Now that you understand the basic bignum algorithms, let's look at various tricks to speed up these operations.

Faster multiplication

Like I mentioned in the previous part of this series, the primary school method for addition and subtraction is the fastest known, for bignums of any size. And you can't really get better than O(n). However, primary school multiplication is O(n²), and there are several better algorithms than that.

Multiplication by a fixnum

As you now know, multiplication is done by looping over the half-digits of the two bignum arguments in a nested loop. Nested loops are something to be avoided as much as possible, because this means you're looking at a quadratic time algorithm, in other words it will perform O(n²) operations.

When multiplying a bignum by a fixnum, the naive implementation is to "promote" the fixnum into a bignum, and then perform a standard bignum multiplication. However, if the fixnum fits in a half-digit, you can avoid the nested loop. Complexity-wise, this isn't a great improvement, as the outer loop is only run once anyway. But, because the small value fits in a machine word, you only read from one bignum address instead of two on every iteration of the inner loop. Now that makes a big difference! In Scheme48, this improved algorithm looks like this:

static void
bignum_destructive_scale_up(bignum_type bignum, bignum_digit_type factor)
{
  bignum_digit_type carry = 0;
  bignum_digit_type * scan = (BIGNUM_START_PTR (bignum));
  bignum_digit_type two_digits;
  bignum_digit_type product_low;
#define product_high carry
  bignum_digit_type * end = (scan + (BIGNUM_LENGTH (bignum)));
  BIGNUM_ASSERT ((factor > 1) && (factor < BIGNUM_RADIX_ROOT));
  while (scan < end)
    {
      two_digits = (*scan);
      product_low = ((factor * (HD_LOW (two_digits))) + (HD_LOW (carry)));
      product_high =
        ((factor * (HD_HIGH (two_digits))) +
         (HD_HIGH (product_low)) +
         (HD_HIGH (carry)));
      (*scan++) = (HD_CONS ((HD_LOW (product_high)), (HD_LOW (product_low))));
      carry = (HD_HIGH (product_high));
    }
  /* A carry here would be an overflow, i.e. it would not fit.
     Hopefully the callers allocate enough space that this will
     never happen.
   */
  BIGNUM_ASSERT (carry == 0);
  return;
#undef product_high
}

The current version of this algorithm in CHICKEN is a bit shorter and slightly more versatile because it returns the carry if the result doesn't fit:

static C_uword bignum_digits_destructive_scale_up_with_carry(
    C_uword *start, C_uword *end, C_uword factor, C_uword carry)
{
  C_uword digit, p;

  assert(C_fitsinbignumhalfdigitp(carry));
  assert(C_fitsinbignumhalfdigitp(factor));

  while (start < end) {
    digit = (*start);

    p = factor * C_BIGNUM_DIGIT_LO_HALF(digit) + carry;
    carry = C_BIGNUM_DIGIT_LO_HALF(p);

    p = factor * C_BIGNUM_DIGIT_HI_HALF(digit) + C_BIGNUM_DIGIT_HI_HALF(p);
    (*start++) = C_BIGNUM_DIGIT_COMBINE(C_BIGNUM_DIGIT_LO_HALF(p), carry);
    carry = C_BIGNUM_DIGIT_HI_HALF(p);
  }
  return carry;
}

This is a destructive operation, which means it doesn't operate on an "empty" target bignum. Instead, you copy the original bignum, which is then mutated in-place. That makes it faster because you're only reading and writing to a single digit array, so it's much more localised in memory. In the next post, we'll see why returning the carry can be helpful.

Strictly speaking, you could do the same for addition and subtraction: if one of the arguments is a bignum and the other a fixnum, you could destructively add or subtract it. In fact, Scheme48 does this, but the CHICKEN implementation does not. If you'll recall, the CHICKEN bignum implementation already copies the larger of the two numbers into a new bignum and modifies it in-place while adding the smaller number. The effect of this is almost the same as Scheme48 does, while also improving the default case of adding two bignums.

There's a second optimisation that can be done, which Scheme48 does not do at all. If the factor by which we're multiplying is a power of two, you can simply shift the result by the 2log of the factor! The code for detecting this is pretty ugly, so I'm not going to show it. Just remember that this makes a huge difference in practice.

Karatsuba's multiplication

Multiplying two bignums can be done faster, too. There are two important algorithms: Karatsuba multiplication and FFT-based multiplications like Schönhage-Strassen. I must confess that my understanding of the FFT is too weak to implement, let alone explain the Schönhage-Strassen algorithm, so if you want to read about that, check the reading list at the end of this post.

Luckily, the Karatsuba algorithm is easy to explain. It was named after the Russian mathematician Anatoly Karatsuba, who was the first to reject the idea that O(n²) is the best we can do. His approach is based on divide and conquer. It uses a simple algebraic trick to reduce work on each step.

Let's say we want to multiply two bignums of two limbs, x[1,2] and y[1,2] in base B. The result is of course xy. Rephrased as an equation:

 xy = (x₁·B + x₂) · (y₁·B + y₂)

This can be written out as:

 xy = (x₁·y₁)·B² + (x₁·y₂ + x₂·y₁)·B + x₂·y₂

If we call these three components a, b and c, then xy = a·B² + b·B + c.

Now, you can calculate all three components separately, but this requires exactly as many steps as the "school book" algorithm we already have, namely O(n²). However, the crucial insight that Karatsuba had is that you can derive b from a and c with a simpler multiplication. He actually did this by first making it more complex. This shows the man's genius: complicating things to simplify them isn't exactly intuitive! So we have a, b and c:

a = x₁·y₁
b = x₁·y₂ + x₂·y₁
c = x₂·y₂

We first complicate b by adding (a-a) and (c-c), then expand:

b = x₁·y₂ + x₂·y₁ + (a - a) + (c - c)
b = x₁·y₂ + x₂·y₁ + ((x₁·y₁) - (x₁·y₁)) + (x₂·y₂) - (x₂·y₂)

Next, we remove the parentheses and reorder by moving one a and one c to the left, and finally we can re-group the moved variables:

b = x₁·y₂ + x₂·y₁ + x₁·y₁ - x₁·y₁ + x₂·y₂ - x₂·y₂
b = x₁·y₁ + x₂·y₂ + x₂·y₁ + x₁·y₂ - x₁·y₁ - x₂·y₂
b = (x₁ + x₂)·(y₁ + y₂) - x₁·y₁ - x₂·y₂

And ta-da, as if by magic, we have:

b = (x₁ + x₂)·(y₁ + y₂) - a - c

If you compare this with the original definition of b, you'll notice that the new definition has one multiplication instead of two. The multiplication factors are slightly larger than in the original formula, but because there's only one, we end up doing less work. Of course, we'll still have to do all those additions and subtractions, but with big enough numbers, it's faster than multiplying twice. That's because the complexity of this algorithm has dropped from O(n²) to something less. The exact complexity turns out to be O(n^log3).

Now, this is the asymptotic complexity. You can easily tell from the new formula of b that it uses several more operations and it requires splitting up x and y in two parts, which requires allocating memory for them and copying out the parts. This is a lot of overhead, which means that Karatsuba's algorithm only becomes more efficient than the "school book" algorithm for (very) large numbers. The same is true for the FFT-based algorithms mentioned earlier, those improve performance only for even larger numbers, because they split up the numbers into even more pieces. This is a (better) reason why it's probably not worth adding the extra code and complexity of an FFT implementation to CHICKEN core.

In code, a naive implementation of Karatsuba's algorithm is quite simple as well:

;; Where this is called, we ensure that |x| <= |y|
(define (karatsuba-multiply x y)
  (let* ((rs (fx* (signum x) (signum y)))     ; Result's sign (-1 or +1)
         (x (abs x))
         (n (bignum-digit-count y))
         (n/2 (fxshr n 1))                    ; (floor (/ n 2))
         (bits (fx* n/2 *bignum-digit-bits*))
         (x-hi (extract-digits x n/2 #f))     ; x[n..n/2]
         (x-lo (extract-digits x 0 n/2)))     ; x[n/2..0]
    (let* ((y (abs y))
           (y-hi (extract-digits y n/2 #f))   ; y[n..n/2]
           (y-lo (extract-digits y 0 n/2))    ; y[n/2..0]
           (a  (* x-hi y-hi))
           (b  (* x-lo y-lo))
           (c  (* (- x-hi x-lo)
                  (- y-hi y-lo))))
      (* rs (+ (arithmetic-shift a (fx* bits 2))
               (+ (arithmetic-shift (+ b (- a c)) bits)
                   b)))))))

Especially helpful here is that the operators prefixed by fx indicate clearly when we're working with fixnums. This uses absolute numbers because that's simpler to deal with, especially when using extract-digits to quickly get a range of bignum digits. It might be possible to make this faster by operating directly on the signed numbers.

In CHICKEN 5, this algorithm has been translated to C for stupid reasons which I will explain in a later post. But this Scheme code is taken directly from the numbers egg and cleaned up only slightly.

According to MCA, an optimal implementation should avoid allocating the intermediate calculations of (- x-hi x-lo) and (- y-hi y-lo). In a low level C-based implementation like CHICKEN 5's, it might be easier to perform in-place modification of these numbers, but so far I haven't been successful in doing this. Nevertheless, our Karatsuba implementation is efficient enough for now.

Sometimes, the Karatsuba algorithm is referred to as the Toom-Cook algorithm. That's because Toom and Cook figured out a way to generalise the algorithm. This way, you can split the numbers into any number of components, instead of two components like Karatsuba did. Apparently there's a sweet spot in number sizes where a 3-way split is faster than a 2-way split, but pretty soon after that, the numbers get big enough and the FFT-based algorithms overtake Toom-Cook in efficiency.

Faster division

Faster multiplication is interesting, but division is the real tortoise in this race, so let's see how we can speed that up. It turns out that the approaches are rather similar to those of multiplication.

Division by a fixnum

Recall that the division algorithm needs to "guess" how many times the denominator fits in the numerator based on the first half-digit (plus some magic surrounding the second half-digit). If the denominator is itself is only a half-digit, there's no need to guess.

So, just like when multiplying by a small fixnum, we have a destructive division algorithm that operates on a copy of the numerator. The Scheme48 version I started with:

/* Given (denominator > 1), it is fairly easy to show that
   (quotient_high < BIGNUM_RADIX_ROOT), after which it is easy to see
   that all digits are < BIGNUM_RADIX. */

static bignum_digit_type
bignum_destructive_scale_down(bignum_type bignum, bignum_digit_type denominator)
{
  bignum_digit_type numerator;
  bignum_digit_type remainder = 0;
  bignum_digit_type two_digits;
#define quotient_high remainder
  bignum_digit_type * start = (BIGNUM_START_PTR (bignum));
  bignum_digit_type * scan = (start + (BIGNUM_LENGTH (bignum)));
  BIGNUM_ASSERT ((denominator > 1) && (denominator < BIGNUM_RADIX_ROOT));
  while (start < scan)
    {
      two_digits = (*--scan);
      numerator = (HD_CONS (remainder, (HD_HIGH (two_digits))));
      quotient_high = (numerator / denominator);
      numerator = (HD_CONS ((numerator % denominator), (HD_LOW (two_digits))));
      (*scan) = (HD_CONS (quotient_high, (numerator / denominator)));
      remainder = (numerator % denominator);
    }
  return (remainder);
#undef quotient_high
}

And the smaller version based, again, on the division algorithm from Hacker's Delight:

static C_uword bignum_digits_destructive_scale_down(
  C_uword *start, C_uword *end, C_uword denominator)
{
  C_uword digit, k = 0;
  C_uhword q_j_hi, q_j_lo;

  /* Single digit divisor case from Hacker's Delight, Figure 9-1,
   * adapted to modify u[] in-place instead of writing to q[].
   */
  while (start < end) {
    digit = (*--end);

    k = C_BIGNUM_DIGIT_COMBINE(k, C_BIGNUM_DIGIT_HI_HALF(digit)); /* j */
    q_j_hi = k / denominator;
    k -= q_j_hi * denominator;

    k = C_BIGNUM_DIGIT_COMBINE(k, C_BIGNUM_DIGIT_LO_HALF(digit)); /* j-1 */
    q_j_lo = k / denominator;
    k -= q_j_lo * denominator;
    
    *end = C_BIGNUM_DIGIT_COMBINE(q_j_hi, q_j_lo);
  }
  return k; /* The remainder */
}

And, just like with multiplication, if you're dividing by a power of two, there's an easy optimisation: you can simply shift the numerator right by as many bits as the 2log of the denominator. The remainder is formed by the bits that were shifted out.

Burnikel-Ziegler division

And here too, there's a divide and conquer algorithm named after the mathematician (or, in this case, mathematicians) who discovered it. This algorithm is a lot more complicated than Karatsuba's relatively simple algebraic trick, and a lot harder to implement correctly. The paper is long and still not very helpful when it comes down to the crucial details. I found the super-elegant presentation in MCA to be more helpful in figuring out details. Especially the algorithm's start-up procedure is tricky to get right. I will use the explanation style from MCA because it is simpler than the original paper.

The algorithm is similar to the classical "school book" division algorithm, but "in the large". The basic idea is that we operate on partial bignums at a time instead of on half-digits. The core algorithm handles only 2n/n division. This means that the numerator must be twice the size of the denominator. It splits the numerator in two halves. Each half is then divided by the entire denominator and finally recombined to form the result. These two divisions are themselves also done in two steps, thereby making the numbers smaller in the recursion.

Because the intermediate division only divides by the first half of the denominator, the result may end up negative. So, like the schoolbook method, this algorithm also makes an adjustment when re-joining the two intermediate results. The core algorithm is as follows:

  • If the denominator is smaller than some limit, fall back to "primary school" algorithm, otherwise:
  • Split the denominator B in two: B₁·β + B₂. So, if B is a bignum of n limbs, the base β is half that.
  • Next, split the numerator A into four such parts: A₁·β³ + A₂·β² + A₃·β + A₄.
  • First half:
    • Divide A₁·β + A₂ by B₁, yielding the guessed quotient Q̂₁ and remainder R₁ (the recursive step).
    • Combine the remainder R₁ with A₃ and subtract Q̂·B₂, yielding R̂₁ = R₁·β + A₃ - Q̂·B₂.
    • While R̂₁ < Q̂₁·B₂, adjust the guess; Q̂₁:=Q̂₁-1 and R̂₁:=R̂₁+B.
  • Second half:
    • Divide R̂₁ by B₁, yielding the guessed quotient Q̂₂ and remainder R (another recursive step).
    • Combine the remainder R with A₄ and subtract Q̂·B₂, yielding R̂ = R + A₄ - Q̂·B₂.
    • While R̂ < Q̂₂·B₂, adjust the guess; Q̂₂:=Q̂₂-1 and R̂:=R̂+B.
  • Recombination of quotient after division:
    • The final quotient is Q̂₁·β + Q̂₂, the final remainder is just .

The interesting part is that B₂ is only ever used for checking the guess. It is not involved in any division. Of course, in the recursion B₂ is also split in two parts, so the high half will be used in the next division, and so on.

In the diagram you can see how it works on a (very) small sample:

In the diagram, B₁ = 31 is shown in white on the left in the first and fourth rows. B₂ = 21 is shown in green on the left in the second and final rows. In the first row you also see highlighted in white A₁·β + A₂ = 3456 and R₁ = 15. In the second row, A₃ = 78 is shown in white, as it drops down to form R̂₁ = R₁·β + A₃ = 1578 with Q̂₁ = 111.

Between the second and third rows, Q̂₁ = 110 is adjusted to 111 and R̂₁ = -753 is adjusted to 2368 by adding the numerator.

In the third row we continue with the second half, dividing R̂₁ by B₁ in the same manner and then recombining R = 43 with A₄ = 67 and subtracting Q̂·B₂ = 1575. As no more adjustments are needed, we're done, with R̂ = 2792 and Q̂₂ = 75. Combine Q̂₁,Q̂₂ into Q = 11075 and we're done!

Burnikel and Ziegler present the algorithm in their paper in a bit of a roundabout way that didn't make sense to me at first. It requires understanding the big picture, which they don't really explain up front. So it's best to read the paper entirely, and then go back and re-read it to grasp the details. It's a bit bottom-up in a sense, because they refactor it into two algorithms; one for dividing numbers 2n/n, and one for dividing numbers 3n/2n. This confused me no end, as it resulted in a bit of a cyclic definition.

In the explanation I gave above, 2n/n is the overall algorithm as a whole. The first and second "halves" of the algorithm are really identical, and represented by Burnikel and Ziegler as two calls to the 3n/2n algorithm. This can be seen in the Scheme code below:

(define (digit-bits n) (fx* *bignum-digit-bits* n))    ; Small helper

;; Here and in 2n/1n we pass both b and [b1, b2] to avoid splitting
;; up the number more than once.  This is a helper function for 2n/n.
(define (burnikel-ziegler-3n/2n a12 a3 b b1 b2 n)
  (receive (q^ r1)
      (if (< (arithmetic-shift a12 (fxneg (digit-bits n))) b1)
          (let* ((n/2 (fxshr n 1))                     ; (floor (/ n 2))
                 (b11 (extract-digits b1 n/2 #f))      ; b1[n..n/2]
                 (b12 (extract-digits b1 0 n/2)))      ; b1[n/2..0]
            (burnikel-ziegler-2n/1n a12 b1 b11 b12 n))
          ;; Don't bother dividing if a1 is a larger number than b1.
	  ;; We use a maximum guess instead (see paper for proof).
          (let ((base*n (digit-bits n)))
            (values (- (arithmetic-shift 1 base*n) 1)  ; B^n-1
                    (+ (- a12 (arithmetic-shift b1 base*n)) b1))))
    (let ((r1a3 (+ (arithmetic-shift r1 (digit-bits n)) a3)))
      (let lp ((r^ (- r1a3 (* q^ b2)))
               (q^ q^))
        (if (negative? r^)
            (lp (+ r^ b) (- q^ 1))                     ; Adjust!
            (values q^ r^))))))

;; The main 2n/n algorithm which calls 3n/2n twice.  Here, a is the
;; numerator, b the denominator, n is the length of b (in digits) and
;; b1 and b2 are the two halves of b (these never change).
(define (burnikel-ziegler-2n/1n a b b1 b2 n)
  (if (or (fxodd? n) (fx< n DIV-LIMIT))                ; Can't recurse?
      (quotient&remainder a b)                         ; Use school division
      (let* ((n/2 (fxshr n 1))
             ;; Split a and b into n-sized parts [a1, ..., a4] and [b1, b2]
             (a12 (extract-digits a n #f))             ; a[m..n]
             (a3  (extract-digits a n/2 n))            ; a[n..n/2]
             (a4  (extract-digits a 0 n/2)))           ; a[n..0]
        ;; Calculate high quotient and intermediate remainder (first half)
        (receive (q1 r1) (burnikel-ziegler-3n/2n a12 a3 b b1 b2 n/2)
          ;; Calculate low quotient and final remainder (second half)
          (receive (q2 r) (burnikel-ziegler-3n/2n r1 a4 b b1 b2 n/2)
            ;; Recombine quotient parts as q = [q1,q2]
            (values (+ (arithmetic-shift q1 (digit-bits n/2)) q2) r))))))

The reason b1, b2 are passed in but not a1, ..., a4 has to do with the "full" algorithm, which deals with unbalanced division where a may be bigger than 2n, given b of size n. There, b is constant, so it pays off to "cache" b1 and b2. Because a keeps changing, we don't cache it.

This full algorithm for dividing two numbers x and y of arbitrary lengths is as follows: If the denominator y is of size n, we can simply chop up the numerator x into n-sized pieces. We then perform a division algorithm on those pieces, using a sort of "sliding window" over x. This passes [x_{i+1},x_i] and y to 2n/n, and recombines the remainder r_i with x_{i-1} to get [r_i,x_{i-1}], which is used for 2n/n in the next iteration, and so on.

Well, in theory it's simple...

(define (quotient&remainder/burnikel-ziegler x y return-quot? return-rem?)
  ;; The caller will already have verified that abs(x) > abs(y), but we
  ;; need to remember the signs of the input and make them absolute.
  (let* ((q-neg? (if (negative? y) (not (negative? x)) (negative? x)))
         (r-neg? (negative? x))
         (x (abs x))
         (y (abs y))
         (s (bignum-digit-count y))
         ;; Define m as min{2^k|(2^k)*DIV_LIMIT > s}.
         ;; This ensures we shift as little as possible (less pressure
         ;; on the GC) while maintaining a power of two until we drop
         ;; below the threshold, so we can always split N in half.
         (m (fx* 2 (integer-length (fx/ s DIV-LIMIT))))
         (j (fx/ (fx+ s (fx- m 1)) m))  ; j = s/m, rounded up
         (n (fx* j m))
         ;; Normalisation, just like with normal school division
         (norm-shift (fx- (digit-bits n) (integer-length y)))
         (x (arithmetic-shift x norm-shift))
         (y (arithmetic-shift y norm-shift))
         ;; l needs to be the smallest value so that a < base^{l*n}/2
         (l (fx/ (fx+ (bignum-digit-count x) n) n))
         (l (if (fx= (digit-bits l) (integer-length x)) (fx+ l 1) l))
         (t (fxmax l 2))
         (y-hi (extract-digits y (fxshr n 1) #f))   ; y[n..n/2]
         (y-lo (extract-digits y 0 (fxshr n 1))))   ; y[n/2..0]
    (let lp ((zi (arithmetic-shift x (fxneg (digit-bits (fx* (fx- t 2) n)))))
             (i (fx- t 2))
             (quot 0))
      (receive (qi ri) (burnikel-ziegler-2n/1n zi y y-hi y-lo n)
        (let ((quot (+ (arithmetic-shift quot (digit-bits n)) qi)))
          (if (fx> i 0)
              (let ((zi-1 (let* ((base*n*i-1 (fx* n (fx- i 1)))
                                 (base*n*i   (fx* n i))
                                 ;; x_{i-1} = x[n*i..n*(i-1)]
                                 (xi-1 (extract-digits x base*n*i-1 base*n*i)))
                            (+ (arithmetic-shift ri (digit-bits n)) xi-1))))
                (lp zi-1 (fx- i 1) quot))
              ;; De-normalise remainder, but only if necessary
              (let ((rem (if (or (not return-rem?) (eq? 0 norm-shift))
                             ri
                             (arithmetic-shift ri (fxneg norm-shift)))))
                ;; Return values (quot, rem or both) with correct sign:
                (cond ((and return-quot? return-rem?)
                       (values (if q-neg? (- quot) quot)
                               (if r-neg? (- rem) rem)))
                      (return-quot? (if q-neg? (- quot) quot))
                      (else (if r-neg? (- rem) rem))))))))))

As you can see, this procedure is extremely hairy. The trickery is in how the bignums are chopped up into n-sized pieces. The sizes we use should be nice powers of two, which improves the algorithm's effectiveness. Notice the (or (fxodd? n) (fx< n DIV-LIMIT)) check in 2n/1n; whenever that is true, we fall back to the school division. This has to be avoided as much as possible, so that's why we try to scale up the number x to nicely rounded multiples of n. At the same time, you have to make sure that the numbers don't grow too large, because that would create more work for the algorithm!

The particular calculation is tricky, but the idea is simple: you want to scale up both numbers to the closest power of two that's larger than the cutoff size. Then, the numerator is scaled up so that it is a size that's a multiple of n, the final size of the denominator. No doubt my implementation of this part of the algorithm can be simplified.

Reading list

First, start with the reading list of the previous post, because most of those references discuss advanced algorithms as well. The ones below are either more specific or more advanced than the descriptions you'll find in the standard references.

Flattr this!  Bitcoin  (why?)

Older articles...
Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution 3.0 License. All code fragments on this site are hereby put in the public domain.