CHICKEN's numeric tower: part 2

This is the second part documenting my journey to add full numeric tower support to CHICKEN core. In this post I explain some of the basic algorithms. You'll need to understand these before going on to the next part, which deals with fancier versions of these algorithms.

Classical numerical algorithms

Like I mentioned in my previous post, the Scheme48 numerical code used only the so-called "classical" algorithms. Comments in the Scheme48 code refer to Donald Knuth's seminal work, The Art of Computer Programming, Volume 2, chapter 4. Interestingly, after these classical algorithms, Knuth explains a few faster algorithms, but Scheme48 did not implement these.

Addition and subtraction

Addition and subtraction are extremely simple algorithms: you simply loop over the limbs of both numbers simultaneously, and add them together, taking care to propagate the carry or borrow from the previous position. This is the same algorithm you learned in primary school. The difference is that the computer can add a whole machine word, while at school you would handle one decimal position at a time. This is O(n) in complexity:

For subtraction the algorithm is the same, except it uses borrowing instead of carrying. You might wonder what happens if the value being subtracted is bigger than the one being subtracted from. If those numbers are both positive, that results in a negative number, but when subtracting a negative number from a smaller positive number, its result would be positive.

The solution is simple in case you're using unsigned representation with explicit sign: You compare the absolute values first. If the second value is larger than the first, you swap them first. Then you subtract them and toggle the sign of the result: If a - b = x, then multiplying all factors by -1 gives: -a + b = -x, or simply b - a = -x.

As far as I'm aware, the primary school algorithm is it. There are no shortcuts, and no quicker ways around it. However, Scheme48 used a surprising representation for their bignums: the limbs inside the bignum did not make use of the top two bits in the machine word. Presumably they did this for portability and correctness. You see, in C, signed overflow is undefined, just so compilers can eke out a little more performance. I think this is completely ridiculous, and it's another source of security issues, but that's what life with C is like.

However, CHICKEN uses the -fwrapv compiler option to enforce sane overflow behaviour. That means CHICKEN bignums are free to use all available bits in a machine word. This representation will also use slightly less memory for really large bignums, especially on 32-bit systems. But, more importantly, it's faster because there's less masking and checking going on. Here's the heart of Scheme48's bignum addition:

while (scan_y < end_y)
{
  sum = ((*scan_x++) + (*scan_y++) + carry);
  if (sum < BIGNUM_RADIX)     /* No overflow */
    {
      (*scan_r++) = sum;
      carry = 0;
    }
  else                        /* Overflow, adjust and set carry */
    {
      (*scan_r++) = (sum - BIGNUM_RADIX);   /* sum modulo radix */
      carry = 1;
    }
}

And here is CHICKEN's:

while (scan_y < end_y) {
  digit = *scan_r;
  if (carry) {
    sum = digit + *scan_y++ + 1;
    carry = sum <= digit; /* Overflow if wrapped result is smaller or equal */
  } else {
    sum = digit + *scan_y++;
    carry = sum < digit;  /* Overflow if wrapped result is smaller */
  }
  (*scan_r++) = sum;
}

Aside from the difference in coding style, you can see that Scheme48 needs to adjust the result if we got a carry. The BIGNUM_RADIX is the maximum bignum digit value plus one. In terms of instructions, this masking and checking doesn't make that much of a difference, surprisingly enough.

But while tweaking this algorithm, I discovered that a nice performance improvement could be gained: First, copy the larger bignum to the result bignum, and then you loop over the second bignum, adding its limbs to the result's limbs, modifying it in-place. I suppose this is faster because you're only manipulating two pointers at a time rather than three. This is why scan_x is not used in the CHICKEN version. This requires memcpy to be fast, so on some systems, the CHICKEN approach can potentially be slower than the Scheme48 one.

Multiplication

Multiplication is where things start to get more interesting. The classical algorithm is still pretty basic, but much slower because it's O(n²) in complexity. This is because in this algorithm, we multiply each position in the first number by every position in the other number, in a nested loop:

As the graphic attempts to clarify, we take only half-digits when multiplying, because the result must fit a single digit. This slows things down even further, because we can only iterate over the limbs at half speed. In Scheme48's code, it looked like this:

#define x_digit x_digit_high
#define y_digit y_digit_high
#define product_high carry
while (scan_x < end_x)
  {
    x_digit = (*scan_x++);
    x_digit_low = (HD_LOW (x_digit));
    x_digit_high = (HD_HIGH (x_digit));
    carry = 0;
    scan_y = start_y;
    scan_r = (start_r++);
    while (scan_y < end_y)
      {
        y_digit = (*scan_y++);
        y_digit_low = (HD_LOW (y_digit));
        y_digit_high = (HD_HIGH (y_digit));
        product_low =
          ((*scan_r) +
           (x_digit_low * y_digit_low) +
           (HD_LOW (carry)));
        product_high =
          ((x_digit_high * y_digit_low) +
           (x_digit_low * y_digit_high) +
           (HD_HIGH (product_low)) +
           (HD_HIGH (carry)));
        (*scan_r++) =
          (HD_CONS ((HD_LOW (product_high)), (HD_LOW (product_low))));
        carry =
          ((x_digit_high * y_digit_high) +
           (HD_HIGH (product_high)));
      }
    (*scan_r) += carry;
  }

The #define statements at the start are rather interesting, and seem to have been meticulously chosen to maximise re-use of variables. This was probably done to cajole inefficient compilers into re-using registers. Some of the bignum code is originally from 1986, when C compilers weren't very sophisticated! The HD_CONS macro combines two halfwords together, while the HD_LOW and HD_HIGH extract the low and high halfword from a machine word, respectively:

#define HD_LOW(digit) ((digit) & BIGNUM_HALF_DIGIT_MASK)
#define HD_HIGH(digit) ((digit) >> BIGNUM_HALF_DIGIT_LENGTH)
#define HD_CONS(high, low) (((high) << BIGNUM_HALF_DIGIT_LENGTH) | (low))

Remember, Scheme48 bignum digits use only 30 bits on a 32-bit machine and 62 bits on a 64-bit machine, so the masking and shifting is required. Because CHICKEN bignum digits now used the full machine word, I was able to replace it with another, much shorter implementation, which relies on "automatic" truncation of machine words:

/* From Hacker's Delight, Figure 8-1 (top part) */
for (j = 0; j < length_y; ++j) {
  yj = C_uhword_ref(yd, j);
  if (yj == 0) continue;
  carry = 0;
  for (i = 0; i < length_x; ++i) {
    product = (C_uword)C_uhword_ref(xd, i) * yj +
              (C_uword)C_uhword_ref(rd, i + j) + carry;
    C_uhword_set(rd, i + j, product);
    carry = C_BIGNUM_DIGIT_HI_HALF(product);
  }
  C_uhword_set(rd, j + length_x, carry);
}

As the comment says, this code is adapted from the fantastic booklet "Hacker's Delight" by Henry S. Warren, so any elegance you see in this code is not due to me! The original code is even more elegant, but it assumes little-endian order of bignum digits and the halfwords within these digits. On big endian machines the halfwords will be swapped within their machine words, so I introduced C_uhword_ref and C_uhword_set, which are ugly macros to select the higher/lower halfword of the relevant machine word:

/* The bignum digit representation is fullword- little endian, so on
 * LE machines the halfdigits are numbered in the same order.  On BE
 * machines, we must swap the odd and even positions.
 */
#ifdef C_BIG_ENDIAN
#define C_uhword_ref(x, p)           ((C_uhword *)(x))[(p)^1]
#else
#define C_uhword_ref(x, p)           ((C_uhword *)(x))[(p)]
#endif
#define C_uhword_set(x, p, d)        (C_uhword_ref(x,p) = (d))

The (C_uhword *) casts here ensure that only a halfword is extracted. Most machines have an instruction to fetch a halfword into a register, which is much faster than masking it out. So, even if it's ugly and hacky, I vastly prefer this over the Scheme48 code.

Division

Oh boy, where to start? The above algorithms are so simple, but division, now that's quite a can of worms. To make things worse, many textbooks (including Knuth) gloss over important details, assuming that readers can figure it out on their own.

The first problem is that, unlike the above algorithms, the traditional pen and paper-algorithm doesn't translate well to the computer. Let's look at an example division, performed by hand as you would have learned it in primary school. Here, we divide 543456 (the dividend or numerator) by 344 (the divisor or denominator):

The notation might differ slightly from what you're used to (different schools use different notations, apparently), but the algorithm should be familiar: Given a denominator of n digits, you take n+1 digits from the numerator (but n in the first step!), then divide them by the denominator. You write the quotient on the right. Then you subtract the remainder from the digits you took from the numerator, and you continue with the next digit, until you hit the last digit of the numerator. The final subtraction gives you the remainder at the bottom, and the digits you wrote on the right together form the quotient.

There is a problem with this "algorithm", though: it requires you to divide each numerator part by the entire denominator. If the denominator is a bignum, you're still dividing one bignum by another! Using this algorithm recursively won't work either, because it doesn't reduce the denominator's size.

However, it turns out that you can guess the results of these intermediate divisions, based on the first few digits of both numbers. Intuitively, you can get a pretty good guess of how many times a number fits in another by doing a trial division of their leading digits.

For example, a number like 3xx can fit about 2x times in a number like 7xxx. In other words, our guess is 7/3 = 2. For example, the number 300 will fit 23 times in 7000. This guess isn't completely accurate: for example, the number 399 will fit only 17 times in 7000. Note that the leading digit is now a 1 instead of a 2, which means our guess was bad. So in some cases we need to correct the guess. Note that a guess may never exceed 9, because we're calculating one decimal position of the quotient. All this leads to the following relatively simple algorithm:

Make a guess based on trial division of the leading digits as described above;
Multiply the denominator by the guess, to get a result;
Subtract this result from the numerator, but:
If the subtraction goes below zero, add back the denominator and adjust the guess.

This algorithm would work, but it takes many iterations. It can be improved by taking into account two leading digits of the denominator, instead of one. This improves the accuracy of the guess, and it can be done easily if we only use halfdigits in our calculation (which we'll have to do anyway to avoid overflow when multiplying). In the picture below, for simplicity and brevity, each digit represents one halfword.

The picture above is pretty complicated! I hope it clarifies the algorithm a little. The picture clearly shows two places where this algorithm guessed wrong, in which case we need to adjust some values (shown in red).

To understand the algorithm, first note the highlighted quotient digits with a question mark below them. These indicate that the quotient digit is a guess.

We tentatively multiply this guess by the first halfdigit of the denominator, and subtract it from the current remainder, giving a result in green. Then, we append the next digit from the denominator (in blue) to the result we just got. Finally, we multiply the next digit from the numerator (yellow) by its first digit, and see if the number is less than the combined intermediate remainder. This means the guess was correct; otherwise the guess is incorrect, because the remainder would be negative.

If the guess was wrong, we need to adjust the guess by subtracting one and performing the check again until the guess is correct. You can see this happening near the bottom of the first column in the above picture.

Once we have a correct guess based on the first two halfdigits, we go ahead and calculate the remainder. To do this, we multiply the full n digits of the denominator by the guess, and subtract the first n digits of the remaining numerator. All this can be done simultaneously, in O(n), even!

Unfortunately, after having calculated the remainder, it can turn out negative. This means the original guess was bad after all! In this case we must make a last-minute adjustment, by subtracting one from the quotient, and then adding the denominator to the remainder. This is shown in the picture in the first two steps of the second column.

The actual implementation of this horribly complicated algorithm in Scheme48 was also very complex and extremely long (it's all of the stuff between lines 1045 and 1383). So, instead of attempting to understand and rework this to be faster and more consistent with CHICKEN core, once again I opted to steal an implementation from Hacker's Delight. It looks like this:

static C_regparm void
bignum_destructive_divide_normalized(C_word big_u, C_word big_v, C_word big_q)
{
  C_uword *v = C_bignum_digits(big_v),
          *u = C_bignum_digits(big_u),
          *q = big_q == C_SCHEME_UNDEFINED ? NULL : C_bignum_digits(big_q),
           p,               /* product of estimated quotient & "denominator" */
           hat, qhat, rhat, /* estimated quotient and remainder digit */
           vn_1, vn_2;      /* "cached" values v[n-1], v[n-2] */
  C_word t, k;              /* Two helpers: temp/final remainder and "borrow" */
  /* We use plain ints here, which theoretically may not be enough on
   * 64-bit for an insanely huge number, but it is a _lot_ faster.
   */
  int n = C_bignum_size(big_v) * 2,       /* in halfwords */
      m = (C_bignum_size(big_u) * 2) - 2; /* Correct for extra digit */
  int i, j;		                  /* Just two loop variables */

  /* Part 2 of Gauche's aforementioned trick: */
  if (C_uhword_ref(v, n-1) == 0) n--;

  /* These won't change during the loop, but are used in every step. */
  vn_1 = C_uhword_ref(v, n-1);
  vn_2 = C_uhword_ref(v, n-2);

  /* See also Hacker's Delight, Figure 9-1.  This is almost exactly that. */
  for (j = m - n; j >= 0; j--) {
    /* First, determine the initial guess: */
    hat = C_BIGNUM_DIGIT_COMBINE(C_uhword_ref(u, j+n), C_uhword_ref(u, j+n-1));
    if (hat == 0) {
      if (q != NULL) C_uhword_set(q, j, 0);
      continue;
    }
    qhat = hat / vn_1;
    rhat = hat % vn_1;

    /* Next, keep making early adjustments to the guess
     * until it matches the first two digits:
     */

    /* Two whiles is faster than one big check with an OR.  Thanks, Gauche! */
    while(qhat >= (1UL << C_BIGNUM_HALF_DIGIT_LENGTH)) { qhat--; rhat += vn_1; }
    while(qhat * vn_2 > C_BIGNUM_DIGIT_COMBINE(rhat, C_uhword_ref(u, j+n-2))
	  && rhat < (1UL << C_BIGNUM_HALF_DIGIT_LENGTH)) {
      qhat--;
      rhat += vn_1;
    }

    /* Finally, multiply and subtract: */
    k = 0;
    for (i = 0; i < n; i++) {
      p = qhat * C_uhword_ref(v, i);
      t = C_uhword_ref(u, i+j) - k - C_BIGNUM_DIGIT_LO_HALF(p);
      C_uhword_set(u, i+j, t);
      k = C_BIGNUM_DIGIT_HI_HALF(p) - (t >> C_BIGNUM_HALF_DIGIT_LENGTH);
    }
    t = C_uhword_ref(u,j+n) - k;
    C_uhword_set(u, j+n, t);

    /* Subtracted too much?
     * Make a late adjustment by adding back the denominator:
     */
    if (t < 0) {
      qhat--;
      k = 0;
      for (i = 0; i < n; i++) {
        t = (C_uword)C_uhword_ref(u, i+j) + C_uhword_ref(v, i) + k;
        C_uhword_set(u, i+j, t);
	k = t >> C_BIGNUM_HALF_DIGIT_LENGTH;
      }
      C_uhword_set(u, j+n, (C_uhword_ref(u, j+n) + k));
    }
    if (q != NULL) C_uhword_set(q, j, qhat);
  } /* end of "j" loop */
}

There are some shoutouts to Gauche, which is a beautifully-crafted Scheme implementation in C. The particular "trick" referred to here simplifies the calculation of our allocation sizes a little bit by ensuring we never shift more than a halfdigit when normalising (see next section).

As you can see from the implementation, the "multiply and subtract" is actually done in one loop which scans over the remainder u and denominator v at the same time, so this is not "magic"; we can perform the multiply and subtract steps over the entire bignum in one efficient O(n) loop. Perhaps surprisingly, the overall algorithm is O(n²), just like multiplication. Division is still much slower than multiplication because each "step" performs more operations (just look at the algorithms!).

Normalisation

A real-world implementation of the above division algorithm will try to reduce the number of guess adjustments. This is done by first normalising or scaling the numbers. This is done by multiplying both the numerator and denominator with the same power of two before starting to do the division. Afterwards, the remainder must be scaled back by dividing by that power of two. Instead of multiplying and dividing, you can of course just shift the numbers.

The number by which is multiplied depends on the numerator's first digit; it must be scaled up to be at least half of the base. In base 10, you need to scale it up to at least 5, while in a "full machine word" base it's even easier: you simply shift the entire number so that the highest bit of the most significant limb is set. How Scheme48 did this:

bignum_digit_type v1 = (BIGNUM_REF ((denominator), (length_d - 1)));
while (v1 < (BIGNUM_RADIX / 2))  /* Is the high bit set yet? */
  {
    v1 <<= 1;
    shift += 1;
  }

In the CHICKEN version, we take a simpler approach by subtracting the integer length from the digit length, which effectively is the same as counting the number of leading zeroes ("nlz"):

C_uword v1 = *(C_bignum_digits(denominator) + length - 1); 
shift = C_BIGNUM_DIGIT_LENGTH - C_ilen(v1); /* nlz */

Then, both numbers are copied into temporary buffers which are shifted left in-place by the number of bits calculated here.

Normalisation works by preventing the algorithm from overshooting. Think about it: any guess may always be too high, never too low! So if you scale the first digit to be as high as possible, you can't so easily make a guess that is too high. It's weird, but the math seems to work out.

A reading list for beginners

I am writing this blog post series mostly as a quick overview and introduction to the struggles and approaches taken in CHICKEN's bignum implementation. It is not intended as a full-on tutorial. If you are serious about implementing a full numeric tower (good for you!) or diving deeper into the CHICKEN code, you'll need more. Unfortunately, good and easy to understand documentation is surprisingly hard to find, so here's a reading list to save you some effort.

Knuth's The Art of Computer Programming, Vol. 2: Seminumerical Algorithms. The definitive reference. Many will say that every self-respecting hacker should have read these books, but truth be told they're rather tough to get through. But even if you do give up working through the books, they serve as great reference material. Why are these books so tough? They're math-heavy (especially the first book), and Knuth uses his own "hypothetical" MIX architecture for all code examples and exercises. Yes, everything is in assembly language! Nevertheless, the books are very thorough, and they're obviously written out of love for the craft.
Tom St Denis's book Multi-Precision Math is much more gentle than Knuth's books. This is the companion book for LibTomMath, a public domain, well-commented bignum library, explicitly written to be easy to understand. The book and library cover mostly classic algorithms, but there are also a handful of "advanced" algorithms, and several special-purpose optimised versions.
Per Brinch Hansen's Multiple-Length Division Revisited: A Tour of the Minefield. This little gem is helpful if you are having trouble following textbook explanations of the classical division algorithm. It was written out of frustration with the poor quality of existing explanations.
MpNT: A Multi-Precision Number Package by Tiplea et al. is another overview of a library's algorithms. This is a bit terser and more math-heavy than the LibTom book, but also covers several more advanced algorithms. This is a very good and complete reference.
Finally, Modern Computer Arithmetic by Richard Brent and Paul Zimmermann is probably the tersest, but also the most complete guide to efficient algorithms that I've found so far. These guys know what they're talking about: this book truly covers the "state of the art". Only for advanced students of numerics :)
As a bonus, if you're serious about efficiency: The "algorithms" section of the GMP manual. These are terse and incomplete, and you usually won't get a complete understanding just by reading them. However, since GMP is the most popular implementation, it is also the fastest: Researchers usually create a proof of concept implementation for GMP and compare it to the existing algorithms. So, it is important to know which algorithms GMP is currently using, and then try to find better papers that explain them.

More magic

Cautionary tales from a programmer

About this blog

CHICKEN's numeric tower: part 2 Posted on 2016-10-13