Designing Lispy DSLs, part 2: SXML

After last time's example of SCSS, I'd like to take a look at SXML, another Lispy DSL I'm using in this blog. It's more successful and more widely-used than SCSS and even has an official specification!

The observation that XML is really an obnoxiously verbose Lisp without parens is common, but the details are (of course) hairier than that. Let's look at an XHTML example:

<div>
  <span>Hello, <strong>dear</strong> friends.</span>
  <span>This is a &lt;simple&gt; example.</span>
</div>

Converting this HTML fragment to an S-expression is straightforward:

'(div
   (span "Hello, " (strong "dear") " friends.")
   (span "This is a <simple> example."))

It's a bit more cumbersome to type because you have to break up the strings for the "strong" element, but aside from that it's simpler, shorter, and less error-prone; "special" characters can be written as-is since they are automatically escaped when the XML document is written. Especially when dealing with large templates and generated content it can be a big time-saver to represent XML as S-expressions; doubly so if you're using paredit. Plus, Scheme is your templating language, and Lisps are rather good at processing lists :)

You might be wondering about XML attributes; S-expressions don't have anything that maps naturally to these. Some XML-in-Lisp variants use keywords for attributes, others use alternating symbols and strings to indicate attributes. SXML takes the more interesting approach that attributes in XML were a mistake; there should only be elements. To compensate, SXML uses a tag name that can't exist in XML (the "@"-sign) and it has the convention that this element can appear as the first child of any element. Child element names represent attribute names, their text contents represent values. The "@"-sign is particularly well-chosen because W3C also uses it elsewhere to indicate attributes (e.g. in XPath and XSLT).

<div id="welcome" class="section">
  <span>Hello, <strong class="affectionately">dear</strong>friends.</span>
  <span>This is a &lt;simple&gt; example.</span>
</div>

becomes:

'(div (@ (id "welcome") (class "section"))
  (span "Hello, " (strong (@ (class "affectionately")) "dear") " friends.")
  (span "This is a <simple> example."))

Let's look at what makes SXML such a good DSL. First, XML has a hierarchical structure, which maps well to S-expressions. It is built up out of only a handful of atoms: it has start tags with attributes, end tags, entities, and textual content in between. In SXML, tag names are mapped to symbols, which can represent any string, so this naturally extends to all possible XML tag names.

When building websites, the fact that regular HTML is less strict than XML is irrelevant; you don't need features like, say, omitting an end tag. In fact, end tags don't even exist in SXML; it models the underlying concept of elements rather than tags; it simply treats tags as artifacts of the serialized textual representation of an element. S-expressions can be seen as an alternative serialized textual representation of the same document described by the "angular brackets and tags" notation.

This is another important aspect of good DSLs; they tend to ignore surface syntax. Instead, they map the underlying tree-like structure to S-expressions. SXML uses elements instead of letting itself get distracted by tags, and it generalizes attributes to fit the tree structure. By representing the structure in S-expressions, you know what parts need to be "escaped" in order to preserve this structure. When writing SXML to XML, all string elements in an SXML document get their angular brackets < and > converted to < and >. The only angular brackets ending up in the output are those that result from serialization of elements to start/end tags. When reading XML, all entities are automatically converted to the characters they represent, so in Scheme you get to work directly with the text contents at the conceptual level. CDATA sections are also eliminated; they are simply represented by their string value.

Some complications

XML isn't as simple as you'd think at first glance. Remember the same observation about CSS? This is a common theme with web technology. Don't even get me started about HTTP! In the words of Oleg Kiselyov, author of the SXML specification and many tools in the SSAX project:

There exists a myth that parsing of XML is easy. An article "Parsing XML"
in the January 2000 issue of Dr.Dobb's Journal states the ease of parsing
as an alleged fact. The author of that article must have overlooked that
there is more to XML than the grammar presented in the XML Recommendation.
There are attribute normalization rules, well-formedness constraints, let
alone validation constraints. XML Namespaces add another layer of complexity.

You can almost hear his frustration... Here's an example to illustrate some things that so far we have glossed over. This isn't a fragment, but a full XML document (with thanks to Jim Ursetto):

<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <entry>
    <content type="xhtml">
      <div xmlns="http://www.w3.org/1999/xhtml">
        <p>I'm invincible!</p>
      </div>
    </content>
  </entry>
</feed>

There are two new things to notice: The document starts with a "special" syntax to indicate that we're using XML version 1.0. These so-called processing instructions provide a generic way of passing (meta-)information to the application, outside the XML document itself. The second new thing is that the Atom feed in this example holds an XHTML document fragment as a sub-document. It uses a namespace declaration to indicate that the div and p tags are taken from a different XML schema than the main document.

Let's see how SXML deals with these new concepts. This document can be represented in several ways, but the following is arguably the simplest:

(*TOP* (@ (*NAMESPACES* (atom "http://www.w3.org/2005/Atom")
                        (xhtml "http://www.w3.org/1999/xhtml")))
  (*PI* xml "version=\"1.0\"")
  (atom:feed
    (atom:entry
      (atom:content (@ (type "xhtml"))
        (xhtml:div (xhtml:p "I'm invincible!"))))))

If you look carefully at the original XML document, you'll see that while feed is the document's root node, it still has a sibling: the processing instruction! There's a "virtual" root element that holds these two, which XML calls the "document entity". SXML generalizes this to an element called *TOP* (for "top-level"). The *NAMESPACES* element (an "attribute" of *TOP*) stores an association list of element name prefixes used to indicate a namespace. The *PI* element is SXML's way of representing the processing instruction (its version "attribute" isn't parsed because it isn't a true attribute in terms of XML; it just looks like one. Don't ask...).

All three pseudo-elements are represented by symbols that are invalid XML tag names due to the asterisk. Just like the @ for attributes, this ensures that they can't possibly clash with any tag name that might occur in a particular document type or future versions of XML.

One disadvantage of encoding namespaces as part of the tag name is that you can't see what namespace a particular element belongs to without first converting the symbol to a string, and then splitting it at the colon. This means namespaces aren't really first class. Most likely this is the case because namespaces were added at a later stage when most of the SXML syntax was already set in stone, and modifying it in some other way to support namespaces would be too invasive.

Tool set

You only realize the absolute brilliance of SXML when you look at its tool set. The XML ecosystem is an entire zoo of mini-languages. Most of these languages (for example XPath, XLink, XSLT) have some kind of corresponding DSL in the SSAX project. This makes it a complete toolbox for anyone working with XML. Sure, the documentation is incoherent and a little on the "academic" side, and the SSAX SourceForge project is a random collection of loosely-related tools that aren't exactly idiomatic Scheme (if there even is such a thing), but go ahead and compare it with tools in other languages.

Most "stock" XML libraries are awkward contraptions. Usually they expose highly verbose object-oriented APIs based on the W3C DOM specification, where constructing even a small tree takes several lines of code. It's so awkward that many programmers will tend to prefer generating the XML manually, by writing out strings.

Dynamic languages tend to do better. First, in Perl, there's XML::Simple. This is a little awkward due to Perl's hash and array syntax, but other than that it is a lot like SXML. However, this library is deprecated in favor of one of those awkward OO libraries, XML::LibXML.

Ruby and Python have convenient "builder" objects which can really speed up generation of XML, but as the name says, these are for building. The format in which you build isn't directly the first-class representation, which makes the API slightly disparate. For both languages, these are not the default libraries either, which makes them less likely to be used by people who want to minimize dependencies.

Finally, even though it's a very static language, Haskell has a pretty good builder-like library too, which seems quite popular. If you ever need to generate XML (or "just" HTML) with one of these languages, do yourself a favor and use one of these libraries.

But back to SXML; let's see how you'd read in a document, manipulate it, and write it back out again. For a change, the code presented is a complete (Chicken) Scheme program. This will read one of the first XML documents in this post, change it, and send it to standard output.

;; First, run the following from the shell to ensure this program will work:
;; $ chicken-install ssax sxml-modifications sxml-serializer
(use ssax sxml-modifications sxml-serializer)

(define doc (ssax:xml->sxml (current-input-port) '()))

(define change
  (sxml-modify
    `("div/@id" replace (id "good-bye"))
    `("../span[1]/text()[1]" replace "Goodbye, ")
    `("../following::*" replace (span "This was more " (em "complex")))
    `("self::*" insert-into ", don't you think?")))

(serialize-sxml (change doc) output: (current-output-port))

The arguments to sxml-modify comprise a mini-DSL, representing actions to take on the XML document. Each action is a list; first an XPath expression which selects node(s) from the document, then the name of the action to perform, followed by the value to use for this action. Each action is executed in sequence, so the XPath expression is relative to the previous action's node set, a little like how chaining works in the excellent jQuery library. Actually, I think there are still some lessons in convenience to learn from jQuery, but that's a different story.

Let's invoke the program and see what happens:

$ cat welcome.xml
<div id="welcome" class="section">
 <span>Hello, <strong class="affectionately">dear</strong>friends.</span>
 <span>This is a &lt;simple&gt; example.</span>
</div>
$ csi -s convert.scm < welcome.xml
<div id="good-bye" class="section">
 <span>Goodbye, <strong class="affectionately">dear</strong>friends.</span>
 <span>This was more <em>complex</em>, don't you think?</span>
</div>

Not too shabby for a 10-line program!

Unfortunately, there's a catch. Originally, I planned to use the Atom feed from the previous section as input, but it turns out that the modifications sub-language doesn't support passing a namespace map to the underlying XPath library. Also, I was unable to use an sxpath expression instead of a standard string XPath expression. This could be bad documentation (the docs for SXML modifications are pretty sparse), or perhaps it's a lack of support for namespaces. A quick look at the source seems to confirm the latter. The lack of support for sxpath expressions is also serious and indicates how "random" the selection of tools in SSAX really is; some of these tools don't even support each other! Luckily, it looks like both limitations aren't fundamental, and could be addressed by a (small?) change in the tools.

I mentioned earlier that the SSAX project is a loose collection of tools with incoherent documentation. My failure in figuring out how to combine namespaces with SXML modifications or use the sxpath DSL from the "SXML modifications" DSL helps point out the importance of a good, robust, and well-documented tool set. This might possibly be more important than a good DSL; if nobody can use your DSL, it might just as well not exist.

Wrapping up

The following rules can be distilled from the SXML design:

Do not slavishly translate surface syntax to S-expressions, but model the structure.
Eliminate or generalize all features that are strictly unnecessary.
When generalizations demand new names, pick ones that are invalid in the source language, but try to borrow familiar conventions from the domain.
When generating output, ensure structural integrity by escaping all content.
People will avoid clumsy DSLs, to the point of falling back on string manipulation.
No matter how well-designed your DSL is, it needs good tools and documentation.
DSLs within the same domain should be mutually supportive.

There are still many aspects of XML we barely touched upon. However, this post is already long enough, and my knowledge of XML (and SXML) only goes so far, so we won't go into more detail. Of course, you can always dig in and find out more yourself; there are plenty of links in this document you can use to study the subjects.

More magic

Cautionary tales from a programmer

About this blog

Designing Lispy DSLs, part 2: SXML Posted on 2012-08-05

Some complications

Tool set

Wrapping up