VCS-independent distribution of language extensions

Today I'd like to talk about how CHICKEN Scheme handles distribution of language extensions (which we call "eggs"). There are some unique features of our setup that might be interesting to users of other languages as well, and I think the way backwards compatibility was kept is rather interesting.

In the beginning

First, a little bit of history, so you know where we're coming from. CHICKEN was initially released in the year 2000, and the core system was available as a tarball on the website. In 2002 it was moved into CVS and in 2004 to Darcs (yes, there were good open source DVCSes before Git).

Throughout this period, eggs were simply stored as tarballs (curiously bearing a ".egg" extension) in some well-known directory on the CHICKEN website. The egg installation tool had this location built in. For example, the egg named foo would be fetched from http://www.call-with-current-continuation.org/eggs/foo.egg.

To contribute (or update!) an extension, you simply sent a tarball to Felix and he would upload it to the site. This was a very centralised way of working, creating a lot of work for Felix. So in 2005, he asked authors to put all eggs in a version control system: Subversion. At the time, every contributer was given write access to the entire repo! These were simpler times, when we had only a handful of contributors.

The switch to Subversion allowed for a neat trick: whenever an egg was modified, it triggered a "post-commit hook" which tarred up the egg and uploaded it to the website. This was a very simple addition which automated the work done by Felix, while ensuring the existing tools did not have to be modified. Egg authors now had the freedom to modify their code as they liked, and new releases would appear for download within seconds.

If an author used the conventional trunk/tags/branches layout, the post-commit hook automatically detected this and would upload the latest tag. In other words, we reached a level of automation where "making a release" was simply tagging your code!

Documentation for eggs originally lived on the same website as the eggs did, but this was eventually moved into svnwiki, one of the first wikis to use Subversion as a backing store. To make things even simpler, the core system was also moved into Subversion. Now everything was in one system, for everyone to hack on, using the same credentials everywhere. Life was good!

Start of the DVCS wars

This worked great for years, and the number of contributors steadily increased. Meanwhile, distributed version control systems were gaining mainstream popularity, and contributors started experimenting with Git, Mercurial, Bazaar and Fossil. People grumbled that CHICKEN was still on Subversion.

The next major release, CHICKEN 4.0, provided for a "clean slate", with the opportunity to rewrite the distribution system. This simplified things, replacing the brittle post-commit script with a CGI program called "henrietta", which would serve the eggs via HTTP. The download location for eggs was put into a configuration file, which allowed users to host their own mirror. This is useful if for example a company wants to set up a private deployment-server containing proprietary eggs. We also gained a mirror for general use, graciously provided by Alaric.

The difference was that now there was no static tarball, but when you downloaded an egg, its files would be served straight from either svn, a local directory tree or a website. If we ever decided to migrate the egg repository to a completely different version control system, we could simply add a new back-end to Henrietta. Nothing would have to be modified on the client.

The new system

In 2009, CHICKEN core was moved into a Git repository, as it looked like Git was winning the DVCS wars. New users were often complaining about having to use crusty old Subversion. By this time, people even used DVCSes exclusively, only synchronising to the svn repo. This meant it was no longer the "canonical" repository for all eggs. It was becoming nothing but a hassle for those who preferred other VCSes.

Another problem was that we had still a maintenance problem: commit access on the svn repo is centrally managed, through one big mod_authz_svn configuration file, listing which users have access to which "sub-repositories". If someone wants to grant commit access to another developer, this has to be requested via the mailing list or the server's maintainer.

Requirements

To solve these problems, we started to consider new ways to allow users to develop their eggs using their favorite VCS. The new system had a few strict requirements:

It had to be completely backwards-compatible. No changes should be made to CHICKEN core. New eggs published through this system should be available to older CHICKENs, too.
It had to be completely VCS-independent. We want to avoid extra work when the next VCS fad comes along. Furthermore, it should work with all popular code hosters, for maximum freedom of choice. Self-hosting should explicitly be an option.
The existing workflow of egg authors should not fundamentally change; especially the release procedure of making a tag should stay.
There should be a way to avoid broken links if someone takes down their repo.
Most of all, the system had to be simple.

A simple solution

The simplest way to make the distribution system VCS-independent is to ignore VCSes altogether! Instead, we download source files over HTTP and mirror them from the CHICKEN server.

This idea was rather natural: our Subversion setup had always allowed direct access to plain files over HTTP through mod_dav_svn. Most popular code hosting sites (Github, Bitbucket, Google Code etc) also allow this, either directly or via some web repo viewer's "download raw file" link, which can be constructed from a VCS tag and file name. Also, Henrietta already supported serving eggs from a local directory tree which meant we had to make almost no modifications to our existing tool chain.

To make this work, all that's needed is:

Some daemon which periodically fetches new eggs.
A "master list" of where each egg is hosted.
For each egg, a list of released versions for that egg.
A "base URI" where the files for that release can be downloaded.
A list of files for that release (or the name of a tarball, which is equivalent).

We already had a so-called ".meta-file", which contains info about the egg (author, license, name, category etc). In an earlier incarnation of the post-commit hook this file also contained a list of the files that the egg consisted of, so it made sense to re-use this facility.

We only needed to take care of the daemon, the master egg list and a way to communicate the base URI. This was simple, and I wrote the daemon (dubbed "henrietta-cache") over a weekend during a hackathon. It really is simple and consists of only 300+ lines of (rather ugly) Scheme code. At the hackathon, Moritz helped out by moving the existing eggs to this new scheme, and testing with various hosting providers.

But not the simplest solution

The clever reader has probably already noted that the setup could be simplified by putting the henrietta-cache logic into the client program. We chose not to do this because it would break two requirements: that of backwards compatibility and that of avoiding broken links.

Strictly speaking, the backwards compatibility problem could be solved by embedding the functionality into chicken-install and eventually removing henrietta-cache from the server.

Broken links are a bigger problem, though. Currently, if a repo becomes unavailable, this is no problem; we still have a cached copy on our servers. Even if the repo goes offline forever and nobody has a copy of it anymore, we can still import the cached files into a fresh repo and take over maintenance from there.

Some incremental improvements

Unfortunately, the new system made it easier for Github and Bitbucket users than for CHICKEN Subversion users to maintain their eggs, because these sites allow tarball downloads, while the Subversion users had to list each file in their egg in the meta file. Under the old system this was not required, because it simply offered the entire svn egg directory for download.

After some people complained about having to do this extra step, I wrote another simple "helper" egg with the tongue-in-cheek name "pseudo-meta-egg-info". This is a small (80 lines) Spiffy web application which can generate "pseudo" meta files containing a full list of all the files in a Subversion subdirectory, and a list of all the tags available. This all happens on-the-fly, which means that egg authors could now revert to their old workflow of simply tagging their egg to make a release!

Technically, this helper webapp can be extended and deployed for any hosting site, so if you decide to host your own repository it could generate the list of tags and files for that, too. CHICKEN isn't big enough to ask Google, Github or Bitbucket to run this on their servers, of course, so some helper plug-ins and shell scripts for svn, hg and git were made as well. These will generate the list of tags and file names and put them in the meta- and release-info files.

Current status

The new system has been in use for over two years (since March 29th, 2011) and it has been doing a good job, requiring only very little maintenance and few modifications after the initial release. We've already reaped the benefits of our setup: Github and Bitbucket both had several periods of downtime, during which eggs were still available, even if they were hosted there.

The following graph shows the number of available CHICKEN eggs, starting with the "release 4" branch (requires an SVG-capable browser). There's a small skew because the script I used to generate the graph only checked for existence, not whether the egg was released.

As you can see, Mercurial (hg) and Git took off almost simultaneously, but where git is still steadily increasing in popularity, hg mostly stagnated. Subversion (svn) saw a few drops from eggs that were moved into hg/git. You'd guess that most git users would use Github, but it turns out that Bitbucket is reasonably popular among Chicken users too. We also have three authors who have opted to host their own repositories. You can see this in the breakdown of today's eggs by host:

Hosting site	VCSes	# of eggs
code.call-cc.org	svn	454
github.com	git	85
bitbucket.org	hg, git	41
gitorious.org	git	5
chust.org	fossil	5
kitten-technologies.co.uk	fossil	3
code.stapelberg.de	git	1

Finally, the graph shows that people are still releasing new eggs from svn, but most new development takes place in git. And yes, there are a few eggs in Fossil, too! Bazaar is currently not listed. One possible explanation is that Loggerhead (its web viewer) does not allow easy construction of stable URLs to raw files for a particular tag (or zip file/tarball), so serving up eggs straight from a repo is not possible. Another reason could be that bzr simply isn't that popular among CHICKEN users. If you're a bzr user and would like to use this distribution scheme, please have a look at Loggerhead issues #473691 and #739022. If you know a way around this, please share your knowledge on our release instructions page.

Things to improve

Needless to say, I'm rather happy that the system satisfied all the requirements we set for it, and that it saw such uptake. The majority of newly released eggs are using one of the new systems (too bad it's Git, but I guess that's inevitable).

However, as always, there is room for improvement. The current system has a few flaws, all of which are due to the fact that henrietta-cache simply copies code off an external site:

There's no "stable" tarball per egg release. This is required for OS package managers, which usually verify with a checksum whether the source package has not changed. Recently, Mario improved on this situation by providing tarballs, but these are merely tarballs of the henrietta-cache mirror on that particular server. However, these should be expected to be stable...
If an egg author moves tags around, nobody will know. Different henrietta-cache mirrors may then have an inconsistent view of the distributed repository. We have two egg mirrors, and so far this has happened once or twice. This requires some manual intervention: just blow away all the cached files and wait for it to re-synch, or trigger the synch manually.
Egg authors cannot sign their eggs; each egg is downloaded from a source that may not be trustworthy. This is tricky, especially because most people don't want to mess around with PGP keys anyway. CHICKEN core releases aren't signed either, so this isn't very high on our priority list.

I think some of these problems are a result of "going distributed", similar to the problem that you should not rewrite history that has already been pushed.

More magic

Cautionary tales from a programmer

About this blog

VCS-independent distribution of language extensions Posted on 2013-06-04