CHICKEN's numeric tower: part 3

Now that you understand the basic bignum algorithms, let's look at various tricks to speed up these operations.

Faster multiplication

Like I mentioned in the previous part of this series, the primary school method for addition and subtraction is the fastest known, for bignums of any size. And you can't really get better than O(n). However, primary school multiplication is O(n²), and there are several better algorithms than that.

Multiplication by a fixnum

As you now know, multiplication is done by looping over the half-digits of the two bignum arguments in a nested loop. Nested loops are something to be avoided as much as possible, because this means you're looking at a quadratic time algorithm, in other words it will perform O(n²) operations.

When multiplying a bignum by a fixnum, the naive implementation is to "promote" the fixnum into a bignum, and then perform a standard bignum multiplication. However, if the fixnum fits in a half-digit, you can avoid the nested loop. Complexity-wise, this isn't a great improvement, as the outer loop is only run once anyway. But, because the small value fits in a machine word, you only read from one bignum address instead of two on every iteration of the inner loop. Now that makes a big difference! In Scheme48, this improved algorithm looks like this:

static void
bignum_destructive_scale_up(bignum_type bignum, bignum_digit_type factor)
{
  bignum_digit_type carry = 0;
  bignum_digit_type * scan = (BIGNUM_START_PTR (bignum));
  bignum_digit_type two_digits;
  bignum_digit_type product_low;
#define product_high carry
  bignum_digit_type * end = (scan + (BIGNUM_LENGTH (bignum)));
  BIGNUM_ASSERT ((factor > 1) && (factor < BIGNUM_RADIX_ROOT));
  while (scan < end)
    {
      two_digits = (*scan);
      product_low = ((factor * (HD_LOW (two_digits))) + (HD_LOW (carry)));
      product_high =
        ((factor * (HD_HIGH (two_digits))) +
         (HD_HIGH (product_low)) +
         (HD_HIGH (carry)));
      (*scan++) = (HD_CONS ((HD_LOW (product_high)), (HD_LOW (product_low))));
      carry = (HD_HIGH (product_high));
    }
  /* A carry here would be an overflow, i.e. it would not fit.
     Hopefully the callers allocate enough space that this will
     never happen.
   */
  BIGNUM_ASSERT (carry == 0);
  return;
#undef product_high
}

The current version of this algorithm in CHICKEN is a bit shorter and slightly more versatile because it returns the carry if the result doesn't fit:

static C_uword bignum_digits_destructive_scale_up_with_carry(
    C_uword *start, C_uword *end, C_uword factor, C_uword carry)
{
  C_uword digit, p;

  assert(C_fitsinbignumhalfdigitp(carry));
  assert(C_fitsinbignumhalfdigitp(factor));

  while (start < end) {
    digit = (*start);

    p = factor * C_BIGNUM_DIGIT_LO_HALF(digit) + carry;
    carry = C_BIGNUM_DIGIT_LO_HALF(p);

    p = factor * C_BIGNUM_DIGIT_HI_HALF(digit) + C_BIGNUM_DIGIT_HI_HALF(p);
    (*start++) = C_BIGNUM_DIGIT_COMBINE(C_BIGNUM_DIGIT_LO_HALF(p), carry);
    carry = C_BIGNUM_DIGIT_HI_HALF(p);
  }
  return carry;
}

This is a destructive operation, which means it doesn't operate on an "empty" target bignum. Instead, you copy the original bignum, which is then mutated in-place. That makes it faster because you're only reading and writing to a single digit array, so it's much more localised in memory. In the next post, we'll see why returning the carry can be helpful.

Strictly speaking, you could do the same for addition and subtraction: if one of the arguments is a bignum and the other a fixnum, you could destructively add or subtract it. In fact, Scheme48 does this, but the CHICKEN implementation does not. If you'll recall, the CHICKEN bignum implementation already copies the larger of the two numbers into a new bignum and modifies it in-place while adding the smaller number. The effect of this is almost the same as Scheme48 does, while also improving the default case of adding two bignums.

There's a second optimisation that can be done, which Scheme48 does not do at all. If the factor by which we're multiplying is a power of two, you can simply shift the result by the 2log of the factor! The code for detecting this is pretty ugly, so I'm not going to show it. Just remember that this makes a huge difference in practice.

Karatsuba's multiplication

Multiplying two bignums can be done faster, too. There are two important algorithms: Karatsuba multiplication and FFT-based multiplications like Schönhage-Strassen. I must confess that my understanding of the FFT is too weak to implement, let alone explain the Schönhage-Strassen algorithm, so if you want to read about that, check the reading list at the end of this post.

Luckily, the Karatsuba algorithm is easy to explain. It was named after the Russian mathematician Anatoly Karatsuba, who was the first to reject the idea that O(n²) is the best we can do. His approach is based on divide and conquer. It uses a simple algebraic trick to reduce work on each step.

Let's say we want to multiply two bignums of two limbs, x[1,2] and y[1,2] in base B. The result is of course xy. Rephrased as an equation:

 xy = (x₁·B + x₂) · (y₁·B + y₂)

This can be written out as:

 xy = (x₁·y₁)·B² + (x₁·y₂ + x₂·y₁)·B + x₂·y₂

If we call these three components a, b and c, then xy = a·B² + b·B + c.

Now, you can calculate all three components separately, but this requires exactly as many steps as the "school book" algorithm we already have, namely O(n²). However, the crucial insight that Karatsuba had is that you can derive b from a and c with a simpler multiplication. He actually did this by first making it more complex. This shows the man's genius: complicating things to simplify them isn't exactly intuitive! So we have a, b and c:

a = x₁·y₁
b = x₁·y₂ + x₂·y₁
c = x₂·y₂

We first complicate b by adding (a-a) and (c-c), then expand:

b = x₁·y₂ + x₂·y₁ + (a - a) + (c - c)
b = x₁·y₂ + x₂·y₁ + ((x₁·y₁) - (x₁·y₁)) + (x₂·y₂) - (x₂·y₂)

Next, we remove the parentheses and reorder by moving one a and one c to the left, and finally we can re-group the moved variables:

b = x₁·y₂ + x₂·y₁ + x₁·y₁ - x₁·y₁ + x₂·y₂ - x₂·y₂
b = x₁·y₁ + x₂·y₂ + x₂·y₁ + x₁·y₂ - x₁·y₁ - x₂·y₂
b = (x₁ + x₂)·(y₁ + y₂) - x₁·y₁ - x₂·y₂

And ta-da, as if by magic, we have:

b = (x₁ + x₂)·(y₁ + y₂) - a - c

If you compare this with the original definition of b, you'll notice that the new definition has one multiplication instead of two. The multiplication factors are slightly larger than in the original formula, but because there's only one, we end up doing less work. Of course, we'll still have to do all those additions and subtractions, but with big enough numbers, it's faster than multiplying twice. That's because the complexity of this algorithm has dropped from O(n²) to something less. The exact complexity turns out to be O(n^log3).

Now, this is the asymptotic complexity. You can easily tell from the new formula of b that it uses several more operations and it requires splitting up x and y in two parts, which requires allocating memory for them and copying out the parts. This is a lot of overhead, which means that Karatsuba's algorithm only becomes more efficient than the "school book" algorithm for (very) large numbers. The same is true for the FFT-based algorithms mentioned earlier, those improve performance only for even larger numbers, because they split up the numbers into even more pieces. This is a (better) reason why it's probably not worth adding the extra code and complexity of an FFT implementation to CHICKEN core.

In code, a naive implementation of Karatsuba's algorithm is quite simple as well:

;; Where this is called, we ensure that |x| <= |y|
(define (karatsuba-multiply x y)
  (let* ((rs (fx* (signum x) (signum y)))     ; Result's sign (-1 or +1)
         (x (abs x))
         (n (bignum-digit-count y))
         (n/2 (fxshr n 1))                    ; (floor (/ n 2))
         (bits (fx* n/2 *bignum-digit-bits*))
         (x-hi (extract-digits x n/2 #f))     ; x[n..n/2]
         (x-lo (extract-digits x 0 n/2)))     ; x[n/2..0]
    (let* ((y (abs y))
           (y-hi (extract-digits y n/2 #f))   ; y[n..n/2]
           (y-lo (extract-digits y 0 n/2))    ; y[n/2..0]
           (a  (* x-hi y-hi))
           (b  (* x-lo y-lo))
           (c  (* (- x-hi x-lo)
                  (- y-hi y-lo))))
      (* rs (+ (arithmetic-shift a (fx* bits 2))
               (+ (arithmetic-shift (+ b (- a c)) bits)
                   b)))))))

Especially helpful here is that the operators prefixed by fx indicate clearly when we're working with fixnums. This uses absolute numbers because that's simpler to deal with, especially when using extract-digits to quickly get a range of bignum digits. It might be possible to make this faster by operating directly on the signed numbers.

In CHICKEN 5, this algorithm has been translated to C for stupid reasons which I will explain in a later post. But this Scheme code is taken directly from the numbers egg and cleaned up only slightly.

According to MCA, an optimal implementation should avoid allocating the intermediate calculations of (- x-hi x-lo) and (- y-hi y-lo). In a low level C-based implementation like CHICKEN 5's, it might be easier to perform in-place modification of these numbers, but so far I haven't been successful in doing this. Nevertheless, our Karatsuba implementation is efficient enough for now.

Sometimes, the Karatsuba algorithm is referred to as the Toom-Cook algorithm. That's because Toom and Cook figured out a way to generalise the algorithm. This way, you can split the numbers into any number of components, instead of two components like Karatsuba did. Apparently there's a sweet spot in number sizes where a 3-way split is faster than a 2-way split, but pretty soon after that, the numbers get big enough and the FFT-based algorithms overtake Toom-Cook in efficiency.

Faster division

Faster multiplication is interesting, but division is the real tortoise in this race, so let's see how we can speed that up. It turns out that the approaches are rather similar to those of multiplication.

Division by a fixnum

Recall that the division algorithm needs to "guess" how many times the denominator fits in the numerator based on the first half-digit (plus some magic surrounding the second half-digit). If the denominator is itself is only a half-digit, there's no need to guess.

So, just like when multiplying by a small fixnum, we have a destructive division algorithm that operates on a copy of the numerator. The Scheme48 version I started with:

/* Given (denominator > 1), it is fairly easy to show that
   (quotient_high < BIGNUM_RADIX_ROOT), after which it is easy to see
   that all digits are < BIGNUM_RADIX. */

static bignum_digit_type
bignum_destructive_scale_down(bignum_type bignum, bignum_digit_type denominator)
{
  bignum_digit_type numerator;
  bignum_digit_type remainder = 0;
  bignum_digit_type two_digits;
#define quotient_high remainder
  bignum_digit_type * start = (BIGNUM_START_PTR (bignum));
  bignum_digit_type * scan = (start + (BIGNUM_LENGTH (bignum)));
  BIGNUM_ASSERT ((denominator > 1) && (denominator < BIGNUM_RADIX_ROOT));
  while (start < scan)
    {
      two_digits = (*--scan);
      numerator = (HD_CONS (remainder, (HD_HIGH (two_digits))));
      quotient_high = (numerator / denominator);
      numerator = (HD_CONS ((numerator % denominator), (HD_LOW (two_digits))));
      (*scan) = (HD_CONS (quotient_high, (numerator / denominator)));
      remainder = (numerator % denominator);
    }
  return (remainder);
#undef quotient_high
}

And the smaller version based, again, on the division algorithm from Hacker's Delight:

static C_uword bignum_digits_destructive_scale_down(
  C_uword *start, C_uword *end, C_uword denominator)
{
  C_uword digit, k = 0;
  C_uhword q_j_hi, q_j_lo;

  /* Single digit divisor case from Hacker's Delight, Figure 9-1,
   * adapted to modify u[] in-place instead of writing to q[].
   */
  while (start < end) {
    digit = (*--end);

    k = C_BIGNUM_DIGIT_COMBINE(k, C_BIGNUM_DIGIT_HI_HALF(digit)); /* j */
    q_j_hi = k / denominator;
    k -= q_j_hi * denominator;

    k = C_BIGNUM_DIGIT_COMBINE(k, C_BIGNUM_DIGIT_LO_HALF(digit)); /* j-1 */
    q_j_lo = k / denominator;
    k -= q_j_lo * denominator;
    
    *end = C_BIGNUM_DIGIT_COMBINE(q_j_hi, q_j_lo);
  }
  return k; /* The remainder */
}

And, just like with multiplication, if you're dividing by a power of two, there's an easy optimisation: you can simply shift the numerator right by as many bits as the 2log of the denominator. The remainder is formed by the bits that were shifted out.

Burnikel-Ziegler division

And here too, there's a divide and conquer algorithm named after the mathematician (or, in this case, mathematicians) who discovered it. This algorithm is a lot more complicated than Karatsuba's relatively simple algebraic trick, and a lot harder to implement correctly. The paper is long and still not very helpful when it comes down to the crucial details. I found the super-elegant presentation in MCA to be more helpful in figuring out details. Especially the algorithm's start-up procedure is tricky to get right. I will use the explanation style from MCA because it is simpler than the original paper.

The algorithm is similar to the classical "school book" division algorithm, but "in the large". The basic idea is that we operate on partial bignums at a time instead of on half-digits. The core algorithm handles only 2n/n division. This means that the numerator must be twice the size of the denominator. It splits the numerator in two halves. Each half is then divided by the entire denominator and finally recombined to form the result. These two divisions are themselves also done in two steps, thereby making the numbers smaller in the recursion.

Because the intermediate division only divides by the first half of the denominator, the result may end up negative. So, like the schoolbook method, this algorithm also makes an adjustment when re-joining the two intermediate results. The core algorithm is as follows:

If the denominator is smaller than some limit, fall back to "primary school" algorithm, otherwise:
Split the denominator B in two: B₁·β + B₂. So, if B is a bignum of n limbs, the base β is half that.
Next, split the numerator A into four such parts: A₁·β³ + A₂·β² + A₃·β + A₄.
First half:
- Divide A₁·β + A₂ by B₁, yielding the guessed quotient Q̂₁ and remainder R₁ (the recursive step).
- Combine the remainder R₁ with A₃ and subtract Q̂·B₂, yielding R̂₁ = R₁·β + A₃ - Q̂·B₂.
- While R̂₁ < Q̂₁·B₂, adjust the guess; Q̂₁:=Q̂₁-1 and R̂₁:=R̂₁+B.
Second half:
- Divide R̂₁ by B₁, yielding the guessed quotient Q̂₂ and remainder R (another recursive step).
- Combine the remainder R with A₄ and subtract Q̂·B₂, yielding R̂ = R + A₄ - Q̂·B₂.
- While R̂ < Q̂₂·B₂, adjust the guess; Q̂₂:=Q̂₂-1 and R̂:=R̂+B.
Recombination of quotient after division:
- The final quotient is Q̂₁·β + Q̂₂, the final remainder is just R̂.

The interesting part is that B₂ is only ever used for checking the guess. It is not involved in any division. Of course, in the recursion B₂ is also split in two parts, so the high half will be used in the next division, and so on.

In the diagram you can see how it works on a (very) small sample:

In the diagram, B₁ = 31 is shown in white on the left in the first and fourth rows. B₂ = 21 is shown in green on the left in the second and final rows. In the first row you also see highlighted in white A₁·β + A₂ = 3456 and R₁ = 15. In the second row, A₃ = 78 is shown in white, as it drops down to form R̂₁ = R₁·β + A₃ = 1578 with Q̂₁ = 111.

Between the second and third rows, Q̂₁ = 110 is adjusted to 111 and R̂₁ = -753 is adjusted to 2368 by adding the numerator.

In the third row we continue with the second half, dividing R̂₁ by B₁ in the same manner and then recombining R = 43 with A₄ = 67 and subtracting Q̂·B₂ = 1575. As no more adjustments are needed, we're done, with R̂ = 2792 and Q̂₂ = 75. Combine Q̂₁,Q̂₂ into Q = 11075 and we're done!

Burnikel and Ziegler present the algorithm in their paper in a bit of a roundabout way that didn't make sense to me at first. It requires understanding the big picture, which they don't really explain up front. So it's best to read the paper entirely, and then go back and re-read it to grasp the details. It's a bit bottom-up in a sense, because they refactor it into two algorithms; one for dividing numbers 2n/n, and one for dividing numbers 3n/2n. This confused me no end, as it resulted in a bit of a cyclic definition.

In the explanation I gave above, 2n/n is the overall algorithm as a whole. The first and second "halves" of the algorithm are really identical, and represented by Burnikel and Ziegler as two calls to the 3n/2n algorithm. This can be seen in the Scheme code below:

(define (digit-bits n) (fx* *bignum-digit-bits* n))    ; Small helper

;; Here and in 2n/1n we pass both b and [b1, b2] to avoid splitting
;; up the number more than once.  This is a helper function for 2n/n.
(define (burnikel-ziegler-3n/2n a12 a3 b b1 b2 n)
  (receive (q^ r1)
      (if (< (arithmetic-shift a12 (fxneg (digit-bits n))) b1)
          (let* ((n/2 (fxshr n 1))                     ; (floor (/ n 2))
                 (b11 (extract-digits b1 n/2 #f))      ; b1[n..n/2]
                 (b12 (extract-digits b1 0 n/2)))      ; b1[n/2..0]
            (burnikel-ziegler-2n/1n a12 b1 b11 b12 n))
          ;; Don't bother dividing if a1 is a larger number than b1.
	  ;; We use a maximum guess instead (see paper for proof).
          (let ((base*n (digit-bits n)))
            (values (- (arithmetic-shift 1 base*n) 1)  ; B^n-1
                    (+ (- a12 (arithmetic-shift b1 base*n)) b1))))
    (let ((r1a3 (+ (arithmetic-shift r1 (digit-bits n)) a3)))
      (let lp ((r^ (- r1a3 (* q^ b2)))
               (q^ q^))
        (if (negative? r^)
            (lp (+ r^ b) (- q^ 1))                     ; Adjust!
            (values q^ r^))))))

;; The main 2n/n algorithm which calls 3n/2n twice.  Here, a is the
;; numerator, b the denominator, n is the length of b (in digits) and
;; b1 and b2 are the two halves of b (these never change).
(define (burnikel-ziegler-2n/1n a b b1 b2 n)
  (if (or (fxodd? n) (fx< n DIV-LIMIT))                ; Can't recurse?
      (quotient&remainder a b)                         ; Use school division
      (let* ((n/2 (fxshr n 1))
             ;; Split a and b into n-sized parts [a1, ..., a4] and [b1, b2]
             (a12 (extract-digits a n #f))             ; a[m..n]
             (a3  (extract-digits a n/2 n))            ; a[n..n/2]
             (a4  (extract-digits a 0 n/2)))           ; a[n..0]
        ;; Calculate high quotient and intermediate remainder (first half)
        (receive (q1 r1) (burnikel-ziegler-3n/2n a12 a3 b b1 b2 n/2)
          ;; Calculate low quotient and final remainder (second half)
          (receive (q2 r) (burnikel-ziegler-3n/2n r1 a4 b b1 b2 n/2)
            ;; Recombine quotient parts as q = [q1,q2]
            (values (+ (arithmetic-shift q1 (digit-bits n/2)) q2) r))))))

The reason b1, b2 are passed in but not a1, ..., a4 has to do with the "full" algorithm, which deals with unbalanced division where a may be bigger than 2n, given b of size n. There, b is constant, so it pays off to "cache" b1 and b2. Because a keeps changing, we don't cache it.

This full algorithm for dividing two numbers x and y of arbitrary lengths is as follows: If the denominator y is of size n, we can simply chop up the numerator x into n-sized pieces. We then perform a division algorithm on those pieces, using a sort of "sliding window" over x. This passes [x_{i+1},x_i] and y to 2n/n, and recombines the remainder r_i with x_{i-1} to get [r_i,x_{i-1}], which is used for 2n/n in the next iteration, and so on.

Well, in theory it's simple...

(define (quotient&remainder/burnikel-ziegler x y return-quot? return-rem?)
  ;; The caller will already have verified that abs(x) > abs(y), but we
  ;; need to remember the signs of the input and make them absolute.
  (let* ((q-neg? (if (negative? y) (not (negative? x)) (negative? x)))
         (r-neg? (negative? x))
         (x (abs x))
         (y (abs y))
         (s (bignum-digit-count y))
         ;; Define m as min{2^k|(2^k)*DIV_LIMIT > s}.
         ;; This ensures we shift as little as possible (less pressure
         ;; on the GC) while maintaining a power of two until we drop
         ;; below the threshold, so we can always split N in half.
         (m (fx* 2 (integer-length (fx/ s DIV-LIMIT))))
         (j (fx/ (fx+ s (fx- m 1)) m))  ; j = s/m, rounded up
         (n (fx* j m))
         ;; Normalisation, just like with normal school division
         (norm-shift (fx- (digit-bits n) (integer-length y)))
         (x (arithmetic-shift x norm-shift))
         (y (arithmetic-shift y norm-shift))
         ;; l needs to be the smallest value so that a < base^{l*n}/2
         (l (fx/ (fx+ (bignum-digit-count x) n) n))
         (l (if (fx= (digit-bits l) (integer-length x)) (fx+ l 1) l))
         (t (fxmax l 2))
         (y-hi (extract-digits y (fxshr n 1) #f))   ; y[n..n/2]
         (y-lo (extract-digits y 0 (fxshr n 1))))   ; y[n/2..0]
    (let lp ((zi (arithmetic-shift x (fxneg (digit-bits (fx* (fx- t 2) n)))))
             (i (fx- t 2))
             (quot 0))
      (receive (qi ri) (burnikel-ziegler-2n/1n zi y y-hi y-lo n)
        (let ((quot (+ (arithmetic-shift quot (digit-bits n)) qi)))
          (if (fx> i 0)
              (let ((zi-1 (let* ((base*n*i-1 (fx* n (fx- i 1)))
                                 (base*n*i   (fx* n i))
                                 ;; x_{i-1} = x[n*i..n*(i-1)]
                                 (xi-1 (extract-digits x base*n*i-1 base*n*i)))
                            (+ (arithmetic-shift ri (digit-bits n)) xi-1))))
                (lp zi-1 (fx- i 1) quot))
              ;; De-normalise remainder, but only if necessary
              (let ((rem (if (or (not return-rem?) (eq? 0 norm-shift))
                             ri
                             (arithmetic-shift ri (fxneg norm-shift)))))
                ;; Return values (quot, rem or both) with correct sign:
                (cond ((and return-quot? return-rem?)
                       (values (if q-neg? (- quot) quot)
                               (if r-neg? (- rem) rem)))
                      (return-quot? (if q-neg? (- quot) quot))
                      (else (if r-neg? (- rem) rem))))))))))

As you can see, this procedure is extremely hairy. The trickery is in how the bignums are chopped up into n-sized pieces. The sizes we use should be nice powers of two, which improves the algorithm's effectiveness. Notice the (or (fxodd? n) (fx< n DIV-LIMIT)) check in 2n/1n; whenever that is true, we fall back to the school division. This has to be avoided as much as possible, so that's why we try to scale up the number x to nicely rounded multiples of n. At the same time, you have to make sure that the numbers don't grow too large, because that would create more work for the algorithm!

The particular calculation is tricky, but the idea is simple: you want to scale up both numbers to the closest power of two that's larger than the cutoff size. Then, the numerator is scaled up so that it is a size that's a multiple of n, the final size of the denominator. No doubt my implementation of this part of the algorithm can be simplified.

Reading list

First, start with the reading list of the previous post, because most of those references discuss advanced algorithms as well. The ones below are either more specific or more advanced than the descriptions you'll find in the standard references.

The GMP manual has a chapter on Karatsuba Multiplication.
Gaudry et al, A GMP-based Implementation of Schönhage-Strassen’s Large Integer Multiplication Algorithm. The title says it all.
Martin Fürer, Faster Integer Multiplication. This paper describes what is currently the fastest known algorithm for multiplying extremely large numbers (based on FFT). This is asymptotically fastest, but for practical bignum sizes, Schönhage-Strassen remains king.
Yan-Bin Jia, Polynomial Multiplication and Fast Fourier Transform. Lab notes for a CS course on problem solving techniques. Pretty terse, but might be helpful in understanding how FFT works.
Burnikel & Ziegler, Fast Recursive Division. The original paper on this algorithm.
Karl Hasselstrom, Fast Division of Large Integers. This paper compares Newton's algorithm, the Burnikel-Ziegler algorithm and something called "Barret's Algorithm". The conclusion seems to be that Burnikel-Ziegler is usually fastest.
Deamentiaemundi's blog post about implementing Burnikel-Ziegler in JavaScript, including code.

More magic

Cautionary tales from a programmer

About this blog

CHICKEN's numeric tower: part 3 Posted on 2016-10-15