About this blog

Hi, I'm Peter Bex, a Scheme and free software enthusiast from the Netherlands. See my user page on the CHICKEN wiki or my git server for some of my projects.


Are you in need of support for CHICKEN Scheme? Or maybe you want clear tech writing like on this very blog? Then you're in luck! I am now available as a freelance consultant!

The 3 most recent posts (archive) Atom feed

As you may know, I co-maintain the uri-generic egg, together with Ivan Raikov. We had just been working on fixing a bug and porting it to CHICKEN 5 when I stumbled across the WHATWG URL specification, an evolution over RFC 3986. I found it hard to believe they dropped the formal grammar from the RFC, so I checked the issue queue and found a closed ticket from 2015.

They replaced the BNF with a series of steps which is several pages long and overly concerned with implementation-specific details.

It really got to me that such an important and basic part of the web stack is so informally specified. So I wrote an appeal to them to restore a formal grammar in this ticket. I think the reasons are worth being spread more widely, so I'm reproducing it here on my blog.

My request

I would like to offer my opinion from an implementor's perspective and hopefully convince the WG to restore a formal grammar. Let me start by providing some background on where I'm coming from. Feel free to skip this next section.

My background

I am the co-maintainer of the uri-generic egg for CHICKEN Scheme. This implementation attempts to follow RFC 3986 to the letter, and this has resulted in what IMO is a very high-quality implementation (at least, as far as parsing is concerned; URL construction still has some known issues). Oftentimes when we ran into issues, we've compared it with other implementations. It turns out that many of these are lacking in some way or another. I think the main reason is that they're not attempting to really implement the formal grammar (even if they claim to be RFC compliant), while we do. We even have a growing repository of alternative implementations using different parser generators which all pass the same test suite! (feel free to now call me a smug Lisp/Scheme weenie :) )

I wasn't aware of the WHATWG spec until I saw it mentioned in a libcurl post. It piqued my interest because I'm always looking for more test cases. The web platform test suite looks like a big, juicy set to start using in our egg's tests. I'd also consider implementing the WHATWG spec if this increases compatibility with other implementations.

What I expect from a spec

As an implementor, I routinely check the RFC's ABNF as a guide to determine what a valid URL should look like. If someone finds a certain URL our implementation doesn't parse, or if it parses an URL that it shouldn't, the first thing I do is go back to the ABNF in the RFC to verify the behaviour. It is compact, to the point and, for a trained eye, it is trivial to quickly determine if a parser should accept a given (sub)string or not.

The collected ABNF of RFC 3986 is a brief three screenful. In contrast, the algorithm in the WHATWG spec is roughly eighteen screenful. It is an overly detailed and nonstandard way of defining a grammar. This makes it harder to determine which language is accepted by this algorithm. It also makes it hard for me to determine what the changes are, compared to the RFC. Implementing the WHATWG spec would (for me) involve a complete rewrite.

The specification is so focused on the mechanics of a specific manual parsing technique that it almost precludes parser generators or other implementations. Parser generators have a long tradition in theory and practice, and can generate **efficient** language recognisers. Even today, it is an active research field; PEG grammars for example have been "discovered" as recently as 2004.

The way I think about it is that the purpose of this spec is to define what a URL "officially" looks like. So, as an implementor, I don't understand the hesitation to supply a formal grammar. Not having one will likely result in different people interpreting the spec differently. This results in _less_ interoperability, which defeats the point of a spec.

Other reasons why I think a formal grammar is important

Finally, I would like to emphasise the importance of parsers based on formal grammars over ad hoc ones for security reasons. Let's say you have a pipeline of multiple processors which use different URL parsers. For example, you might have a HTML parser on a comment form which cleans URLs by dropping JavaScript and data URLs, among other things, or a mail client which blocks intranet or file system-local URLs before invoking an HTML viewer. If these are all ad hoc "informal" parsers that try to "fix" syntactically invalid URLs, it is nigh-impossible to verify that filtering them for "safe" URLs is correct. That's because it's impossible to decide which language is really accepted by an ad hoc implementation. An implementation further down the stack might interpret an URL (radically) different from one up the stack and you have a nice little exploit in the making.

If you're not convinced by my measly attempts at explaining this idea, please watch the talk "The Science of Insecurity". Meredith Patterson states the case much more eloquently than I ever could. This talk was an absolute eye-opener for me.

With this context, it baffled me to read the statement that "there are several large parts of the spec that cannot be captured by any kind of grammar". This is literally equivalent to saying "we can't know if an URL will be valid without evaluating the algorithm". This means you cheerfully drag the halting problem into what should be a simple, straightforward notation (come on, URLs aren't **that** ill-defined!). As far as I can tell, the RFC defines a regular grammar. The decision to go from a regular to an unrestricted grammar should not be taken lightly!

Flattr this!  Bitcoin  (why?)

We're getting close to a CHICKEN 5 release, so let's take a look at the cool new stuff!

Overhaul of built-in modules

The biggest change you'll notice when you fire up CHICKEN and start to use it is that the modules that come shipped with core are completely different from CHICKEN 4. The functionality is mostly the same, but we moved things around (a lot!) to make things more logical.

This is also the main reason we decided to bump the major version number: the modules have different names, procedures have been renamed, merged or dropped.

You can take a look at the complete list in the CHICKEN 5 manual. We've taken the module layout from R7RS small as inspiration, but since CHICKEN is still an R5RS Scheme first (with r7rs being an optional extension) we had to make some changes.

So, we define a scheme module which contains the entire R5RS language. For everything that is a CHICKEN-specific extension to standard R5RS Scheme, we put it under a (chicken ...) name, which tries to follow the R7RS naming conventions.

For example, R7RS defines a (scheme process-context) module with the following procedures:

  • command-line
  • exit
  • emergency-exit
  • get-environment-variable
  • get-environment-variables

Likewise, CHICKEN defines a (chicken process-context) module, which is a superset of the corresponding R7RS module. Take a look at its manual page; you can see that it defines many more procedures, but it includes all the standard ones too.

By using the R7RS names but with scheme replaced by chicken, the new modules should be easy to remember for anyone used to R7RS. Of course, you can still write portable standard R7RS programs via the r7rs egg, which defines a 100% compatible (scheme process-context) module with only the R7RS identifiers.

There is one important caveat: Because our scheme modules exports everything from R5RS Scheme, we don't provide, say, a (chicken cxr) module for all the cadadr, caddar and so on, because those are all in scheme. This also means that the (chicken load) module does not export load; that's already in scheme. Instead, it defines various non-standard CHICKEN extensions like load-relative and such.

Saner module imports

Speaking of modules, we've improved the way modules are linked into user code. In CHICKEN 4, there's a very strict distinction between modules and (compilation) units. This was an endless source of confusion for beginners. For example, why did (import foo) give an error when you tried to actually refer to an identifier from the foo module? That's because import didn't actually load the code, just the import library. To actually load the code and import the library, you needed (use foo). You could also load the code without importing it via (require-library foo). This should help with cross-compilation. The idea was that you would only need to load the import library on the host, and have the library itself compiled for the target, but in practice you needed to compile the library twice anyway (once on the host, once for the target).

We got rid of this mess: now the canonical way to import the foo library is simply (import foo). For more info, see this post by Felix outlining how to improve imports.

Full numeric tower

Of course, support for the full numeric tower is a personal favorite of mine, having spent a lot of time to perfect this stuff!

Most importantly, this means you no longer need to worry about integer computations over- or underflowing into a flonum and all the weird floating-point problems that entails. Bignums are also a necessity when dealing with 64-bit numeric C types in the FFI. For example, we finally support the size_t type correctly. To me, complex numbers and exact fractions (aka rational numbers) are a nice added bonus, as you could already get them before with the numbers egg. However, by having these types built-in, they're more efficient and you don't have to worry about passing these numbers to code that can't handle them because support happened not to be compiled in.

Take some time to read my blog series about the numeric tower if you're interested in the details.

Declarative egg description language

The chicken-install program to install eggs was rewritten along with all the surrounding tools. The main reason to do this was to make the life of package maintainers easier.

The old version of chicken-install would download, build, install and (optionally) run the unit tests as part of one command. If any dependencies were missing, it would also recursively download, build, install and run tests for those as well. The new version cleanly separates these steps, by generating shell scripts (batch files on Windows) that can do the necessary actions to build and install.

To make this easier, we also had to re-think the egg "language". In CHICKEN 4, a .setup-file was simply a Scheme program in which a few helper procedures were available for calling the compiler. This means it's impossible to create a simple shell script that will separate the build and install steps. That's why we now have a separate, declarative file which describes the components of an egg. See the .egg file documentation for a concrete example.

The rewritten chicken-install will now also cache eggs to avoid re-downloading the same eggs again and again. By default the cache is stored in a dot-directory under the user's home directory. This can be overridden with the CHICKEN_EGG_CACHE environment variable, which might also help package maintainers take the distributed files from another location.

See these design notes for more information about the goals and motivations behind the rewrite.

Improved support for static compilation

In principle, CHICKEN 4 has good support for static compilation. In practice, egg authors would not include the necessary commands for building their libraries statically. Most people don't have a real need for static linking, which means they tend not to make an effort to support it just in case someone else might need it.

The upshot of this was that you could only really compile programs statically when they didn't use any eggs, or if you created a custom build script that would compile the eggs manually with the required -static option. With the new chicken-install, you get static compilation support automatically, for free.

Note that in CHICKEN 4, you could also build eggs and programs using the so-called deployment mode. This allowed shipping a program with all its libraries in one directory. This worked quite well if your target platform supported it, but not all platforms did. Static compilation covers all the use cases that deployment supported and works reliably on all platforms, so we decided to drop deployment mode with all the complexity it brings.

Other noteworthy things

But wait, there's more!

  • Code generation is now fully deterministic, making builds reproducible. This allows you to verify that any given file of generated C code corresponds to the Scheme source code by recompiling it with the same CHICKEN version, both for user code and for CHICKEN core itself. As an added bonus, because the generated C output is deterministic, ccache can be used to get much faster builds (before, it would invalidate the cache as each file would be different).
  • We've improved how symbols are garbage collected, which was optional and somewhat broken in CHICKEN 4. This will speed up code that generates many symbols, and stops symbol table stuffing attacks from being a threat.
  • We have removed quite a bit of bloat: The srfi-1, srfi-13, srfi-14, srfi-69 and srfi-18 libraries have been removed from core! Not to worry though; they are now available as eggs. This will both allow faster development and encourage innovation and competition from alternatives to these non-essential libraries (especially R7RS-large seems to be geared towards renewal of some of these). We've also moved several non-SRFI procedures from core: object-evict, compile-file, binary-search, procedures for dealing with queues, scan-input-lines and POSIX group-information have all been moved to eggs. Support for SWIG has been removed, as it was bit-rotting and nobody seemed to be using it anyway.
  • Ports can now be bi-directional, so there's no more unnecessary distinction between input-ports and output ports. This maps more cleanly to file descriptor semantics, which can also be opened for both reading and writing.
  • Random number generation has been completely replaced. Before, we used libc's rand(), which produces very low quality random numbers. CHICKEN 5 uses the WELL512 PRNG to generate random integers, and it provides access to the system entropy pool for generating cryptographically secure streams of random bytes (using /dev/urandom on *nix, and RtlGenRandom on Windows).

Conclusion

There's a lot to like about the new CHICKEN, so go ahead and give it a spin! Release candidate 1 was made available today for you to try. The full list of changes can of course be found in the NEWS file. If you're already a happy CHICKEN 4 user, we've created a porting guide for you, to make it easier to make the transition from 4 to 5. If you need more help, you can of course contact the always friendly CHICKEN community.

Flattr this!  Bitcoin  (why?)

Now that we have covered the most important algorithms, it's time to take a look at the internal data representation of extended numerals. This will be the final part in this series of posts.

Ratnums and cplxnums

Recall from my data representations article that fixnums are represented as immediate values, directly within a machine word. Flonums are represented in boxed form, by putting them in an opaque bytevector-like structure.

The data representations of complex numbers and rational numbers are pretty simple. Each have their own type tag, and they both contain two slots: the numerator and denominator in case of rational numbers, and the real and imaginary parts in case of complex numbers.

As you can see in the above diagram, the representations of ratnums and cplxnums are very similar. In the example, the slots contain just fixnums. Rational numbers are the simplest here: they can only contain integers (bignums or fixnums). Complex numbers can consist of any number type except other complex numbers, but the exactness of the real and imaginary components must match. This means you can't have 1.5+2/3i, for example.

In its most complex (haha!) form, a complex number contains a rational number in both the real and the imaginary parts, and these rational numbers both contain bignums as their numerator and denominator. In this situation, the entire complex number takes up a whopping minimum of 29 machine words: 3 words for the wrapper complex number, 2 times 3 words for each of the ratnums, and 4 times 5 words for the bignums inside the ratnums.

We'll now look into why bignums require at least 5 words.

Bignums

Initially I tried to represent bignums as a kind of opaque bytevector, much like how flonums are stored. Memory-wise this is the best representation as it has no unnecessary overhead: only 2 extra words; one for the header and one for the sign. On a 32-bit machine it would look like this:

This representation is somewhat wasteful, because it uses a full machine word to represent the sign, which is only one bit of information! This is done to keep the bignum digits word-aligned, which is important for performance. The sign could be shoved into the header if we really wanted to be frugal on memory, but doing so would also complicate type detection. Alternatively, we could store the bignum's digits in 2s complement form so the sign is simply the high bit of the top digit, but that complicates several algorithms.

Regarding the "bytevector" part: because the limbs are word-aligned, it makes more sense to represent the size in words rather than bytes. Unfortunately, there's no way to do this with the current data representation of CHICKEN. This was the direct cause of the following bug: Someone tried to represent the largest known prime number in CHICKEN, and it failed to behave correctly because we didn't have enough header bits to represent its size. This was just for fun, so no harm was done, but when someone will actually need such numbers in practice, they're out of luck. One of these days we're going to have to tackle this problem...

Performance takes a hit

When I first integrated the "numbers" egg into CHICKEN 5, I also did some benchmarking. It turned out that my initial version made some benchmarks up to 8 times slower, though on average it would slow things down by a factor of 2. As pointed out by Alex Shinn and Felix Winkelmann, the reason it impacts some benchmarks so badly has to do with allocation.

Let's compile a loop going from zero to n, like so:

;; Very silly code, calculates 100 * 100 in a stupid way
(let lp ((i 0))
  (if (= i 100)
      (* i i)
      (lp (add1 i))))

Originally, in CHICKEN 4 without the full numeric tower, the compiled C code looked like this:

/* lp in k207 in k204 in k201 */
static void C_fcall f_220(C_word t0,C_word t1,C_word t2){
  C_word tmp;
  C_word t3;
  C_word t4;
  C_word t5;
  C_word t6;
  C_word *a;
loop:
  C_check_for_interrupt;
  if(!C_demand(C_calculate_demand(4, 0, 2))) {
    C_save_and_reclaim_args((void *)trf_220, 3, t0, t1, t2);
  }
  a=C_alloc(4); /* Allocate flonum for overflow situation */
  if(C_truep(C_i_nequalp(t2, C_fix(100)))) {
    t3=t1;
    {
      C_word av2[2];
      av2[0] = t3;
      av2[1] = C_a_i_times(&a, 2, t2, t2);
      ((C_proc)(void*)(*((C_word*)t3+1)))(2, av2);
    }
  } else {
    t3 = C_a_i_plus(&a, 2, t2, C_fix(1));
    C_trace("test.scm:4: lp");
    t5=t1;
    t6=t3;
    t1=t5;
    t2=t6;
    goto loop;
  }
}

It's not much to look at, but this is very close to optimal code: It's a C loop, which allocates a fixed size of memory from the stack/nursery into which it can write the result of + or *, in case they would overflow.

The compiler knows how it can compile + and * to "inlineable" C functions. Many of the most performance-critical functions are built into the compiler like that. But because the compiler (currently) doesn't perform range analysis, it's not smart enough to figure out that none of these operators in this example can cause an overflow. This bites us especially hard when introducing bignums: because we need to assume that any operator may overflow, we must be able to allocate a bignum. And assuming the result of these operators may be bignums, the next iteration of the loop is a bignum. Adding two bignums of unknown sizes together results in another bignum of unknown size.

Because of the above, we can't pre-allocate in a tight C loop. Instead, we must split our loop in two. This is needed to allow the garbage collector to kick in: if you'll recall from the garbage collector post, we need a continuation both for liveness analysis and as a place to jump back to after GC.

One part of our lp calls an allocating procedure, wrapping up the other part in a continuation:

/* k251 in lp in k223 in k220 in k217 in k214 (first part of our "lp") */
static void C_ccall f_253(C_word c, C_word *av) {
  C_word tmp;
  C_word t0 = av[0];
  C_word t1 = av[1];
  C_word t2;
  C_word *a;
  C_check_for_interrupt;
  if(!C_demand(C_calculate_demand(0, c, 2))) {
    C_save_and_reclaim((void *)f_253, 2, av);
  }
  C_trace("test.scm:6: lp");
  t2 = ((C_word*)((C_word*)t0)[2])[1];
  f_236(t2, ((C_word*)t0)[3], t1);
}

/* lp in k223 in k220 in k217 in k214 (second part of our "lp") */
static void C_fcall f_236(C_word t0, C_word t1, C_word t2){
  C_word tmp;
  C_word t3;
  C_word *a;
  C_check_for_interrupt;
  if(!C_demand(C_calculate_demand(4, 0, 3))) {
    C_save_and_reclaim_args((void *)trf_236, 3, t0, t1, t2);
  }
  a=C_alloc(4);
  if(C_truep(C_i_nequalp(t2, C_fix(100)))) {
    C_trace("test.scm:5: *");
    {
      C_word av2[4];
      av2[0] = C_SCHEME_UNDEFINED;
      av2[1] = t1;
      av2[2] = t2;
      av2[3] = t2;
      C_2_basic_times(4,av2);
    }
  } else {
    /* Allocate continuation of (add1 i), which is the first part (f_253) */
    t3=(*a = C_CLOSURE_TYPE|3, a[1] = (C_word)f_253,
        a[2] = ((C_word*)t0)[2], a[3] = t1, tmp = (C_word)a, a += 4, tmp);
    C_trace("test.scm:6: add1");
    {
      C_word av2[4];
      av2[0] = C_SCHEME_UNDEFINED;
      av2[1] = t3;
      av2[2] = t2;
      av2[3] = C_fix(1);
      C_2_basic_plus(4,av2);
    }
  }
}

As you can imagine, allocating a continuation on the stack every time is pretty heavy, and function calling isn't as cheap as a goto loop either. The first part of the loop doesn't even do anything. It just acts as a continuation to be received by the plus call. You can probably imagine how terrible the code would look if we compiled something like (/ (* (+ a b) (+ c d)) 2). That's at least 4 continuations, instead of a few simple statements.

For this reason, my patch was rejected (and rightly so!). The message was clear: code that doesn't use bignums should never pay a performance penalty just because bignums exist.

In order to fix this situation, I had to come up with a radical change to how bignums worked, or face the possibility that a full numeric tower would not make it into CHICKEN 5.

Adding a new "scratch space" memory region

If we want to make the extended numeric operators as fast as the originals, we must be able to inline them. This prevents garbage collection, because we don't get the continuation for an inlined call. But what if they allocate some unknown quantity of memory? We can't allocate on the stack or heap, because that could cause either to fill up, requiring a GC.

So, the obvious solution is to allocate these objects elsewhere. A separate memory space in which bignums can be stored. But what if that space fills up? Don't we need to initiate a GC then? But this is where we're in luck: bignums are not compound objects! They are huge slabs of opaque data, much like strings. Because they can't refer to other objects, we are dealing with a simplified garbage collection problem: only the objects pointing to a bignum need to be updated.

Unfortunately, finding all the live objects that point to a bignum would be quite difficult. Luckily, like many problems in computer science, this can be easily solved by adding another level of indirection. While we're calling inline functions, we can allocate small objects on the stack, which will remain there, never moving until the next GC. We can use this to our advantage: whenever a bignum is needed, we allocate a fixed-size wrapper object on the stack. This object points into the scratch space, where the actual bignum data lives. See the following diagram:

In the diagram, we have a bignum representing the number 24386824307922, which we've put in a list and a vector, and we also have the rational number 1/24386824307922, which refers to the same bignum in its denominator. All these objects can be on the stack or on the heap. We have no control over them; the user can set any object slot to hold the bignum. We do have control over the wrapper object, and only the wrapper object directly points into scratch space. Because bignums are opaque objects in Scheme, the wrapper is invisible. Thus, user code is (in principle) unable to access the wrapper's data slot, so there will be no direct references to the bignum data portion. This means we're free to move it around without updating anything but the wrapper object's slot.

Note that in the scratch space, we also store a back-pointer to the wrapper object's slot. This allows us to update the wrapper object after moving its matching bignum data blob around. This way, we can reallocate the scratch space when more space is needed.

Some of the native functions like Karatsuba multiplication or Burnikel-Ziegler division generate many temporary values. All such hand-written code has been tuned to erase a bignum's back-pointer when that bignum is no longer needed. It makes the code quite a bit hairier, but it allows (limited) garbage collection to be done when reallocating the scratch space.

With this setup, all numeric operations only need to allocate memory to hold a bignum wrapper object. This is a fixed size, much like in CHICKEN 4, and it means numeric operations can once again be inlined!

Oh, and why a bignum takes up 5 words? Well, sometimes we know that a procedure receives 2 fixnums. In that case, we can pre-allocate a bignum for the case when it overflows. Because we know in its maximum size in advance, there's no need for the scratch space; we can just allocate it in the nursery. For uniformity reasons, such a bignum still requires a wrapper object (2 words) and a bignum data blob (3 words: its header, the sign and one limb). This sounds complicated, but it shortens the specialised code for two fixnums, and allocating only in the nursery is also faster.

Some parting thoughts

Adding full numeric tower support has been extremely educational for me. I'm not really a math person, but having a clear goal like this motivated me to dive in deep into the literature. Overall, I'm happy with how it turned out, but there are always improvements.

For example, instead of doing everything in C it would (of course!) be preferable to do it all in Scheme. Unfortunately, CHICKEN's design makes it hard to do this in an efficient way: it's currently impossible to export Scheme procedures that can be called inline (i.e., non-CPS calls) without copying their full code into the call site. If we can find a way, it would be possible to do 90% of the algorithms in Scheme. The principle on which this would be based can be found in an old paper about the Lucid Common Lisp implementation. Basically, you implement a handful of primitives natively, and everything else can be done in Lisp. For example, SBCL is implemented this way too.

As far as I can tell, of the more popular Scheme implementations, Gambit is the only one that actually does this. I've been very impressed with Gambit in general. Besides having pretty readable Scheme code for bignum algorithms, Gambit has some superior bignum algorithms, most get close to (and in rare cases even surpass) GMP performance. This is mostly due to the hard work of Bradley Lucier, a numerical analyst who has also provided useful feedback on some of my work on the numbers egg, and this series of blog posts. He really knows this stuff! Most other Scheme implementations are in C and still pretty slow due to the algorithms they use, unless of course they use GMP.

In CHICKEN, there is a lot of room for optimisations. But I also think we shouldn't try to implement every algorithm under the sun. Things should generally be fast enough to serve the most common cases. Typical code doesn't use bignums, and if it does it's only small bignums (for instance, when using 64-bit integers with the FFI), which is why I think we should optimise for these cases. For example, my implementations of Karatsuba and Burnikel-Ziegler aren't great, so if anyone feels like having a stab at improving these things we already have (or simply replacing them with a better algorithm), please do!

References

Flattr this!  Bitcoin  (why?)

Older articles...
Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution 3.0 License. All code fragments on this site are hereby put in the public domain.