As you may know, I co-maintain the uri-generic egg, together with Ivan Raikov. We had just been working on fixing a bug and porting it to CHICKEN 5 when I stumbled across the WHATWG URL specification, an evolution over RFC 3986. I found it hard to believe they dropped the formal grammar from the RFC, so I checked the issue queue and found a closed ticket from 2015.

They replaced the BNF with a series of steps which is several pages long and overly concerned with implementation-specific details.

It really got to me that such an important and basic part of the web stack is so informally specified. So I wrote an appeal to them to restore a formal grammar in this ticket. I think the reasons are worth being spread more widely, so I'm reproducing it here on my blog.

My request

I would like to offer my opinion from an implementor's perspective and hopefully convince the WG to restore a formal grammar. Let me start by providing some background on where I'm coming from. Feel free to skip this next section.

My background

I am the co-maintainer of the uri-generic egg for CHICKEN Scheme. This implementation attempts to follow RFC 3986 to the letter, and this has resulted in what IMO is a very high-quality implementation (at least, as far as parsing is concerned; URL construction still has some known issues). Oftentimes when we ran into issues, we've compared it with other implementations. It turns out that many of these are lacking in some way or another. I think the main reason is that they're not attempting to really implement the formal grammar (even if they claim to be RFC compliant), while we do. We even have a growing repository of alternative implementations using different parser generators which all pass the same test suite! (feel free to now call me a smug Lisp/Scheme weenie :) )

I wasn't aware of the WHATWG spec until I saw it mentioned in a libcurl post. It piqued my interest because I'm always looking for more test cases. The web platform test suite looks like a big, juicy set to start using in our egg's tests. I'd also consider implementing the WHATWG spec if this increases compatibility with other implementations.

What I expect from a spec

As an implementor, I routinely check the RFC's ABNF as a guide to determine what a valid URL should look like. If someone finds a certain URL our implementation doesn't parse, or if it parses an URL that it shouldn't, the first thing I do is go back to the ABNF in the RFC to verify the behaviour. It is compact, to the point and, for a trained eye, it is trivial to quickly determine if a parser should accept a given (sub)string or not.

The collected ABNF of RFC 3986 is a brief three screenful. In contrast, the algorithm in the WHATWG spec is roughly eighteen screenful. It is an overly detailed and nonstandard way of defining a grammar. This makes it harder to determine which language is accepted by this algorithm. It also makes it hard for me to determine what the changes are, compared to the RFC. Implementing the WHATWG spec would (for me) involve a complete rewrite.

The specification is so focused on the mechanics of a specific manual parsing technique that it almost precludes parser generators or other implementations. Parser generators have a long tradition in theory and practice, and can generate efficient language recognisers. Even today, it is an active research field; PEG grammars for example have been "discovered" as recently as 2004.

The way I think about it is that the purpose of this spec is to define what a URL "officially" looks like. So, as an implementor, I don't understand the hesitation to supply a formal grammar. Not having one will likely result in different people interpreting the spec differently. This results in _less_ interoperability, which defeats the point of a spec.

Other reasons why I think a formal grammar is important

Finally, I would like to emphasise the importance of parsers based on formal grammars over ad hoc ones for security reasons. Let's say you have a pipeline of multiple processors which use different URL parsers. For example, you might have a HTML parser on a comment form which cleans URLs by dropping JavaScript and data URLs, among other things, or a mail client which blocks intranet or file system-local URLs before invoking an HTML viewer. If these are all ad hoc "informal" parsers that try to "fix" syntactically invalid URLs, it is nigh-impossible to verify that filtering them for "safe" URLs is correct. That's because it's impossible to decide which language is really accepted by an ad hoc implementation. An implementation further down the stack might interpret an URL (radically) different from one up the stack and you have a nice little exploit in the making.

If you're not convinced by my measly attempts at explaining this idea, please watch the talk "The Science of Insecurity". Meredith Patterson states the case much more eloquently than I ever could. This talk was an absolute eye-opener for me.

With this context, it baffled me to read the statement that "there are several large parts of the spec that cannot be captured by any kind of grammar". This is literally equivalent to saying "we can't know if an URL will be valid without evaluating the algorithm". This means you cheerfully drag the halting problem into what should be a simple, straightforward notation (come on, URLs aren't that ill-defined!). As far as I can tell, the RFC defines a regular grammar. The decision to go from a regular to an unrestricted grammar should not be taken lightly!