Today I'd like to discuss Scheme Regular Expressions (SRE). Originally introduced in a library for the Scheme Shell, this DSL has recently been gaining some popularity due to the release of Irregex, a pure R5RS Scheme regex engine with SRE as its native syntax. Irregex has been integrated as the core regex system in Chicken Scheme and Jazz Scheme, and you can easily use it from any other Scheme due to its portability.

Back in 1998, the author of the first SRE implementation (Olin Shivers, one of the funniest Schemers around) posted an announcement to several newsgroups about this new syntax. It's well worth a read; especially the preamble about 100% and 80% solutions is a very inspiring call to arms which provides a good insight into the Scheme way of thinking. By the way, if you liked this, you'll also want to read the classic essay "Worse Is Better", if you haven't already. The bit about "The Right Way" is especially relevant. Consider yourself warned, Schemer!

Figuring out the rules

The "Discussion and design notes" section from the announcement is particularly interesting as it discusses the DSL from a point of view similar to this series of blog posts. It's also the largest section, so we'll just touch upon the important points here. The first thing that really jumps out is the fact that the author has taken a look at many different regex packages for various languages, and even asked Tom Lord and Henry Spencer (both wrote their own regex engines) about obscure details. Doing this kind of in-depth research is a great way to get started when designing a DSL since it provides you with a nicely broad perspective on the various viewpoints of others who went before you. This will reduce your "blindness to complexity". Initially every target language seems simple but there are always pitfalls which, if overlooked, would result in a DSL that's hard to extend or doesn't provide all the features of the target language which a user would need. By looking at other implementations you see how they deal with the more complex nooks and crannies of the target language.

The other main point is that he drew a very clear line in the sand of what features would go into SRE and which wouldn't. The SRE syntax doesn't support any "extended regex" features which would force a particular implementation strategy. This makes the SRE syntax independent of the underlying regex engine, which allows for greater portability and generality, but more importantly, it leaves open the possibility of efficient implementations. This was misunderstood by many people; he had to educate Richard Stallman about why supporting back references in the general SRE syntax is a bad idea:

 My feeling about back references is as follows. Regexps are based on a deep
 theory -- regular sets and DFA's -- that has tremendous implications about
 the operations you can perform on them and the ways you can implement them.
 Back-references completely shatter this framework. They rule out certain
 extremely efficient implementations. They rule out certain operations. They
 have nothing to do with the idea of a "regular expression." They are not one
 with the deep structure of the system.

Repeat that in your mind: They are not one with the deep structure of the system. DSL design notes don't get any more philosophical than that! This points right to the core design principle of SRE (and regexes in general). If you are designing a DSL, you can consider yourself very lucky when you find a guiding principle which is that strong. You should let it inform all your design choices because it will help you achieve a good, cohesive design. This also makes it easier to defend your choices when users start complaining about missing features...

Representational issues

After my SCSS post, I was asked why you'd want to represent CSS using first-class values. I think the SXML DSL example illustrates how powerful a first-class representation can be, but I must admit, I don't see many valid use-cases for "first-classing" a CSS DSL.

However, one important lesson in programming is that you never know what clever things people are going to do with your code; clever things you only wish you thought about. You should see first-class values as an enabler for other people to take your DSL's usefulness to new heights. The "Prime Clingerism" applies to DSLs as well; without a first-class representation, additional features will appear necessary to perform useful operations.

One interesting aspect of the design of the original SRE library for SCSH is how it deals with first-class regexps. It contains a large set of procedures to manipulate the underlying regular expression ADT (abstract data type). Olin believed a separate ADT was required for easy manipulation of regex objects, and it would also allow extension of the supported operator set. Directly operating on the SRE expressions would be harder for programmatic extensions. This distinction allows for a baroque but user-friendly SRE syntax in which it is possible to write one thing in many different ways, while also offering ease of manipulation from user code.

Olin is quick to point out that this does cause massive complexity in the code (a point also raised by Richard Stallman), but says the work is done now and anyone is free to take his code and re-use it (this fits the "100% solution" ideology mentioned at the beginning). This ADT approach is comparable to how Lisp/Scheme compilers internally rewrite the full language to a simpler to manipulate "core language", so it isn't completely unique to SRE.

At first glance, this seems a little hard to defend, especially the fact that there's also a seemingly unnecessary rx macro in his design, while Irregex gets by fine without these. I've asked Alex Shinn, the author of Irregex, about this, and he mentioned that the macro and the ADT were needed in SCSH because it depended on an underlying POSIX regex engine rather than implementing it natively in Scheme. SCSH first reads SREs at compile-time and the macro tries to compile down to the ADT as much as possible. Then, at run-time, this ADT is converted to a POSIX regex string which is compiled by the underlying C regex engine.

Because Irregex is written natively in Scheme, this extra step is not necessary and Alex decided to get rid of these distractions and implement only the SRE syntax, without the rx macro and ADT. The result is a very compact implementation, having about the same size as the SCSH package, but including a full matching engine! As far as I know there isn't any widely-used extension for SCSH which makes use of the ADT interface, and the SRE syntax hasn't been extended as much as Olin foresaw might happen. For these reasons, the choice to drop all that complexity seems like a wise one. However, only time will tell whether that's really the case.

Wrapping up

Let's see what we have learned from the design of SRE:

  • Do your research! Inspect as many libraries and DSLs as possible to gain a broad perspective and avoid "blindness to complexity". If in doubt, ask a domain guru.
  • Relentlessly strip all features that preclude efficient implementation strategies. When users request them back, resist!
  • Design for extensibility and programmability; strive to support a first-class representation.
  • When things have settled down, re-evaluate the design and drop unnecessary features.
  • You don't always need nitty-gritty details from code examples to analyse a DSL :)

I admit, some of these rules are not for the faint of heart, but they make for a very strong and coherent DSL which might see wider adoption than just your initial implementation.