Random thoughts on the substring procedure

Recently there was a small flame war on the Chicken-hackers mailing list. A user new to Scheme asked an innocuous question that drew some heated responses:

 Is there a good reason for this behavior?
 # perl -e 'print substr("ciao",0,10);'
 ciao 
 # ruby -e 'puts "ciao"[0..10]'
 ciao
 # python -c 'print "ciao"[0:10];'
 ciao
 # csi -e '(print (substring "ciao" 0 10))'
 Error: (substring) out of range 0 10

Some popular dynamic languages have a generic "slice" operator which allows the user to supply an end index that's beyond the end of the object, and it'll return from the start position up until the end. Instead, Chicken (and most other Schemes) will raise an error.

On the list, I argued that taking characters 0 through 10 from a 3-character string makes no bloody sense, which is why it's signalling an error. For the record: this can be caught by an exception handler, which makes it a controlled error situation, not a "crash".

Our new user retorted that it's perfectly sane to define the substring procedure as:

 Return a string consisting of the characters between the start position
 and the end position, or the end of the string, whichever comes first.

I think this is a needlessly complex definition. It breaks the rule "do one thing and do it well", from which UNIX derives its power: Conceptually crisp components ease composition.

One of the most valuable things a programming language can offer is the ability to reason about code with a minimum of extra information. This is also why most Schemers prefer a functional programming style; it's easier to reason about referentially transparent code. Let's see what useful facts we can infer from a single (substring s 0 10) call:

The variable s is a string.
The string s is at least 10 characters long.
The returned value is a string.
The returned string is exactly 10 characters long.

If either of the preconditions doesn't hold, it's an error situation and the code following the substring call will not be executed. The above guarantees also mean, for example, that if later you see (string-ref s 8) this will always return a character. In "sloppier" languages, you lose several of these important footholds. This means you can't reason so well about your code's correctness anymore, except by reading back and dragging in more context.

Finally, it is also harder to build the "simple" version of substring on top of the complex one than it is to build the complex one as a "convenience" layer on top of the simpler one. On our list it was quickly shown that it's trivial to do so:

(define (substring/n s start n)
  (let* ((start (min start (string-length s)))
         (end (min (string-length s) (+ start n))))
    (substring s start end)))

;; Easy to use and re-use:
(substring/n "ciao" 1 10) => "iao"

There's even an egg for Chicken called slice which provides a generic procedure which behaves like the ranged index operator in Python/Ruby.

A tangential rant on the hidden costs of sloppiness

The difference in behaviour between these languages is not a coincidence: it's a result of deep cultural differences. The Scheme culture (and in some respects the broader Lisp culture) is one that tends to prefer correctness and precision. This appears in many forms, from Shivers' "100% correct solution" manifesto to Gabriel's Worse Is Better essay and all the verbiage dedicated to correct "hygienic" treatment of macros.

In contrast, some cultures prefer lax and "do what I mean" over rigid and predictable behaviour. This may be more convenient for writing quick one-off scripts, but in my opinion this is just asking for trouble when writing serious programs.

Let's investigate some examples of the consequences of this "lax" attitude. You're probably aware of the recent discovery of several vulnerabilities in Ruby on Rails. Two of these allowed remote code execution simply by submitting a POST request to any Rails application. As this post explains, the parser for XML requests was "enhanced" to automatically parse embedded YAML documents (which can contain arbitrary code). My position is that YAML has absolutely nothing to do with XML (or JSON), which means that if a program wants to parse YAML embedded in XML it must do that itself, or at least explicitly specify it wants automatic type conversions in XML/JSON documents. The Rails developers allowed misplaced convenience and sloppiness to trump precision and correctness, to the point that nobody knew what their code really did.

Another example would be the way PHP, Javascript, and several other languages implicitly coerce types. You can see the hilarious results of the confusion this can cause in the brilliant talk titled "Wat". There are also people filing bug reports for PHP's == operator. Its implicit type conversion is documented, intended, behaviour but it results in a lot of confusion and, again, security issues, as pointed out by PHP Sadness #47. If you allow the programmer to be sloppy and leave important details unspecified, an attacker will gladly fill in those details for you.

Some more fun can be had by looking at the MySQL database and how it mangles data. The PostgreSQL culture also strongly prefers correctness and precision, whereas MySQL's culture is more lax. The clash between these two cultures can be seen in a thread on the PostgreSQL mailinglist where someone posted a video of a comparison between PostgreSQL and MySQL's behaviour. These cultural differences run deep, as you can tell by the responses of shock. And again, the lax behaviour of MySQL has security implications. The Rails folks have discovered that common practices might allow attackers to abuse MySQL's type coercion. Because Rails supports passing typed data in queries, it's possible to force an integer in a condition that expects a string. MySQL will silently coerce non-numerical strings to zero:

SELECT * FROM `users` WHERE `login_token` = 0 LIMIT 1;

This will match the first record (which usually just happens to be the administrative account). Just as with the innocent little substring behaviour we started our journey with, it is possible to work around this, but things would be a lot easier if the software behaved more rigidly and strict, so that this kind of conversion would only be done upon explicit request of the programmer.

Incidentally, it is possible to put MySQL into a stricter "SQL mode":

SET sql_mode='TRADITIONAL';

This is rarely done, probably because most software somehow implicitly relies on this broken behaviour. By the way, does anyone else think it's funny that this mode is called "traditional"? As if it were somehow old-fashioned to expect precise and correct behaviour!

Take back control

It is high time people realised that implicit behaviour and unclear specifications are a recipe for disaster. Computers are by nature rigid and exact. This is a feature we should embrace. Processes in the "real world" are often fuzzy and poorly defined, usually because they are poorly understood. As programmers, it's our job to keep digging until we have enough information to describe the task to a computer. Making APIs fuzzier is the wrong response to this problem, and a sign of weakness. Do you prefer to know exactly what your program will do, or would you rather admit defeat and allow fuzziness to creep into your programs?

In case you're wondering, this rant didn't come out of the blue. One of three reasons this blog is called more magic is as a wry reference to the trend of putting more "magic" into APIs, which makes them hard to control. This is a recurring frustration of mine and I would like to see a move towards less magic of this kind. Yeah, I'm a cynical bastard ;)

More magic

Cautionary tales from a programmer

About this blog

Random thoughts on the substring procedure Posted on 2013-02-17

A tangential rant on the hidden costs of sloppiness

Take back control