The 5 most recent posts (archive) 
Random thoughts on the substring procedure Posted on 2013-02-17
Recently there was a small flame war on the Chicken-hackers mailing list. A user new to Scheme asked an innocuous question that drew some heated responses:
Is there a good reason for this behavior?
# perl -e 'print substr("ciao",0,10);'
ciao
# ruby -e 'puts "ciao"[0..10]'
ciao
# python -c 'print "ciao"[0:10];'
ciao
# csi -e '(print (substring "ciao" 0 10))'
Error: (substring) out of range 0 10
Some popular dynamic languages have a generic "slice" operator which allows the user to supply an end index that's beyond the end of the object, and it'll return from the start position up until the end. Instead, Chicken (and most other Schemes) will raise an error.
On the list, I argued that taking characters 0 through 10 from a 3-character string makes no bloody sense, which is why it's signalling an error. For the record: this can be caught by an exception handler, which makes it a controlled error situation, not a "crash".
Our new user retorted that it's perfectly sane to define the substring procedure as:
Return a string consisting of the characters between the start position and the end position, or the end of the string, whichever comes first.
I think this is a needlessly complex definition. It breaks the rule "do one thing and do it well", from which UNIX derives its power: Conceptually crisp components ease composition.
One of the most valuable things a programming language can offer is the ability to reason about code with a minimum of extra information. This is also why most Schemers prefer a functional programming style; it's easier to reason about referentially transparent code. Let's see what useful facts we can infer from a single (substring s 0 10) call:
- The variable s is a string.
- The string s is at least 10 characters long.
- The returned value is a string.
- The returned string is exactly 10 characters long.
If either of the preconditions doesn't hold, it's an error situation and the code following the substring call will not be executed. The above guarantees also mean, for example, that if later you see (string-ref s 8) this will always return a character. In "sloppier" languages, you lose several of these important footholds. This means you can't reason so well about your code's correctness anymore, except by reading back and dragging in more context.
Finally, it is also harder to build the "simple" version of substring on top of the complex one than it is to build the complex one as a "convenience" layer on top of the simpler one. On our list it was quickly shown that it's trivial to do so:
(define (substring/n s start n) (let* ((start (min start (string-length s))) (end (min (string-length s) (+ start n)))) (substring s start end))) ;; Easy to use and re-use: (substring/n "ciao" 1 10) => "iao"
There's even an egg for Chicken called slice which provides a generic procedure which behaves like the ranged index operator in Python/Ruby.
A tangential rant on the hidden costs of sloppiness
The difference in behaviour between these languages is not a coincidence: it's a result of deep cultural differences. The Scheme culture (and in some respects the broader Lisp culture) is one that tends to prefer correctness and precision. This appears in many forms, from Shivers' "100% correct solution" manifesto to Gabriel's Worse Is Better essay and all the verbiage dedicated to correct "hygienic" treatment of macros.
In contrast, some cultures prefer lax and "do what I mean" over rigid and predictable behaviour. This may be more convenient for writing quick one-off scripts, but in my opinion this is just asking for trouble when writing serious programs.
Let's investigate some examples of the consequences of this "lax" attitude. You're probably aware of the recent discovery of several vulnerabilities in Ruby on Rails. Two of these allowed remote code execution simply by submitting a POST request to any Rails application. As this post explains, the parser for XML requests was "enhanced" to automatically parse embedded YAML documents (which can contain arbitrary code). My position is that YAML has absolutely nothing to do with XML (or JSON), which means that if a program wants to parse YAML embedded in XML it must do that itself, or at least explicitly specify it wants automatic type conversions in XML/JSON documents. The Rails developers allowed misplaced convenience and sloppiness to trump precision and correctness, to the point that nobody knew what their code really did.
Another example would be the way PHP, Javascript, and several other languages implicitly coerce types. You can see the hilarious results of the confusion this can cause in the brilliant talk titled "Wat". There are also people filing bug reports for PHP's == operator. Its implicit type conversion is documented, intended, behaviour but it results in a lot of confusion and, again, security issues, as pointed out by PHP Sadness #47. If you allow the programmer to be sloppy and leave important details unspecified, an attacker will gladly fill in those details for you.
Some more fun can be had by looking at the MySQL database and how it mangles data. The PostgreSQL culture also strongly prefers correctness and precision, whereas MySQL's culture is more lax. The clash between these two cultures can be seen in a thread on the PostgreSQL mailinglist where someone posted a video of a comparison between PostgreSQL and MySQL's behaviour. These cultural differences run deep, as you can tell by the responses of shock. And again, the lax behaviour of MySQL has security implications. The Rails folks have discovered that common practices might allow attackers to abuse MySQL's type coercion. Because Rails supports passing typed data in queries, it's possible to force an integer in a condition that expects a string. MySQL will silently coerce non-numerical strings to zero:
SELECT * FROM `users` WHERE `login_token` = 0 LIMIT 1;
This will match the first record (which usually just happens to be the administrative account). Just as with the innocent little substring behaviour we started our journey with, it is possible to work around this, but things would be a lot easier if the software behaved more rigidly and strict, so that this kind of conversion would only be done upon explicit request of the programmer.
Incidentally, it is possible to put MySQL into a stricter "SQL mode":
SET sql_mode='TRADITIONAL';
This is rarely done, probably because most software somehow implicitly relies on this broken behaviour. By the way, does anyone else think it's funny that this mode is called "traditional"? As if it were somehow old-fashioned to expect precise and correct behaviour!
Take back control
It is high time people realised that implicit behaviour and unclear specifications are a recipe for disaster. Computers are by nature rigid and exact. This is a feature we should embrace. Processes in the "real world" are often fuzzy and poorly defined, usually because they are poorly understood. As programmers, it's our job to keep digging until we have enough information to describe the task to a computer. Making APIs fuzzier is the wrong response to this problem, and a sign of weakness. Do you prefer to know exactly what your program will do, or would you rather admit defeat and allow fuzziness to creep into your programs?
In case you're wondering, this rant didn't come out of the blue. One of three reasons this blog is called more magic is as a wry reference to the trend of putting more "magic" into APIs, which makes them hard to control. This is a recurring frustration of mine and I would like to see a move towards less magic of this kind. Yeah, I'm a cynical bastard ;)
A new domain Posted on 2013-02-11
I've finally decided to get a proper domain name: http://www.more-magic.net. Please update your bookmarks and feed readers!
I used to run this blog on a hostname from the good folks at DynDNS, which I registered in my college days. DynDNS had the benefit of being 100% free (great for poor college students!), but the disadvantage of having to run a tool called ddclient. This tool is intended to update DNS entries for hosts with dynamically assigned IP address, and if you don't run it, your hostname will expire.
Occasionally ddclient gets "stuck", not performing updates anymore. This happens unnoticably, until you get an e-mail from DynDNS stating that your domain will expire in 5 days unless you click the reactivation link and restart ddclient. The hassle of this and the risk of ddclient getting stuck at a bad time, together with the unprofessional quality of running under a domain that's obviously not your own (and harder to remember) finally got me to consider paying for a proper domain. So there you have it: more-magic.net :)
Lessons learned from NUL byte bugs Posted on 2012-12-10
Last time I explained how sloppy representations can cause various vulnerabilities. While doing some research for that post I stumbled across NUL byte injection bugs in two projects. Because both have been fixed now, I feel like I can freely talk about them with a clear conscience.
These projects are Chicken Scheme and the C implementation of Ruby. The difference in the way these systems deal with NUL bytes clearly shows the importance of handling security issues in a structural way. We'll also see the importance of truly grokking the problem when implementing a fix.
A quick recap
Remember that C uses NUL bytes to delimit strings. Many other languages store the length of the string instead. In these languages, NUL bytes can occur inside strings. This can cause unintended reinterpretation when strings cross the language border into C.
In my previous post I already pointed out how Chicken automatically prevents this reinterpretation in its foreign function interface (FFI). You just describe to Scheme that your C function accepts a string, and it will take care of the rest:
(define my-length (foreign-lambda int "strlen" c-string)) ;; Prints 12: (print (my-length "hello, there")) ;; Raises an exception, showing the following message: ;; Error: (##sys#make-c-string) cannot represent string with NUL ;; bytes as C string: "hello\x00there" (print (my-length "hello\x00there"))
The FFI's feature of automatically checking for NUL bytes in strings before passing them on to C was only added in late 2010 (Chicken 4.6.0). However, because everything uses this interface, this mismatch could easily be fixed, in a central location, securing all existing programs in one fell swoop.
Now, you may be thinking "well, that's nothing special; it's good engineering practice that there must be a single point of truth, and that you Don't Repeat Yourself". And you'd be right! In fact, this is a key insight: solid engineering is a prerequisite to secure engineering. It can prevent security bugs from happening, and help to fix them quickly once they are discovered. A core tenet of "structural security" is that without structure, there can be no security.
When smugness backfires
To drive home the point, let's take a look at what I discovered while writing my previous blog post. After describing Chicken's Right Way solution and feeling all smug about it, I noticed an embarrassing problem: for various reasons (some good, others less so), there are places in Chicken where C functions are called without going through the FFI. Some of these contained hand-rolled string conversions!
It turns out that we overlooked these places when first introducing the NUL byte checks, and as a consequence several critical procedures (standard R5RS ones like with-input-from-file) were left vulnerable to exactly this bug:
;; This program outputs "yes" twice in Chickens < 4.8.0 (with-output-to-file "foo\x00bar" (lambda () (print "hai"))) (print (if (file-exists? "foo") "yes" "no")) (print (if (file-exists? "foo\x00bar") "yes" "no"))
To me, this just validates the importance of approaching security measures in a structural rather than an ad-hoc way; the bug was only in those parts of the code that didn't use the FFI. Deviation from a rule is where bugs are often found!
You can also see that we fixed it as thoroughly as possible, especially given the at times awkward structure of the Chicken code. We commented every special situation extensively, assigned a new error type C_ASCIIZ_REPRESENTATION_ERROR for this particular error, and added regression tests for at least each class of functionality (string to number conversion, file port creation, process creation, environment access, and low-level messaging functionality). There's definitely room for improvement here, and I hope to one day reduce the special cases to the bare minimum. By documenting special cases it's easy to avoid introducing new problems. It also makes them easier to find when refactoring. The tests help there too, of course.
When you run the above program in a Chicken version with the fix, it behaves like expected:
Error: cannot represent string with NUL bytes as C string: "foo\x00bar"
Another approach
The Ruby situation is a little more complicated. It has no FFI but a C API, so it works the other way around: you write C to interface "up" into Ruby. It has a StringValueCStr() macro, which is documented as follows (sic):
You can also use the macro named StringValueCStr(). This is just like StringValuePtr(), but always add nul character at the end of the result. If the result contains nul character, this macro causes the ArgumentError exception.
However, this isn't consistently used in Ruby's own standard library:
File.open("foo\0bar", "w") { |f| f.puts "hai" } puts File.exists?("foo") puts File.exists?("foo\0bar")
In Ruby 1.9.3p194 and earlier, this shows the following output, indicating it's vulnerable:
true
test.rb:4:in `exists?': string contains null byte (ArgumentError)
from test.rb:4:in `<main>'
It turns out that internally, Ruby strings are stored with a length, but also get a NUL byte tacked onto the end, to prevent copying when calling C functions. This performance hack undermines the safety of Ruby to C string conversions, and is the direct cause of these inconsistencies. True, there is a safe function that extracts the string while checking for NUL bytes, but there are also various ways to bypass this, and if you accidentally use the wrong macro to extract the (raw) string, your code won't break. Of course, this is only true for benign inputs...
The complexity of Ruby's implementation makes it hard to ensure that it's safe everywhere. Indeed, the various places where strings are passed to C all do it differently. For example, the ENV hash for manipulating the POSIX environment has its own hand-rolled test for NUL, which you can easily verify; it produces a different error message than the one exists? gave us earlier:
irb(main):001:0> ENV["foo\0bar"] = "test" ArgumentError: bad environment variable name
There is no reason this couldn't just use StringValueCStr(). So, even though Ruby has this safe macro, which provides a mechanism to check for poisoned NUL bytes in strings, it's rarely used by Ruby's own internals. This could be fixed just like Chicken; here too, the best way to do that would be to generalize and eliminate all special cases. Simpler code is easier to secure.
A fundamental misunderstanding
When I reported the bug in the File class to the Ruby project, they quickly had a fix, but unfortunately they seemed uninterested in going through Ruby's entire code to fix all string conversions (quoting from private e-mail conversation):
> I agree that this looks like a good place to fix the File/IO > class, but there are many other places where strings are passed to C. > Are all of those secured? All path names should be converted with "to_path" method if possible. If any methods don't obey the rule, it is another bug. Please let us know if you find such case.
In retrospect, there is the possibility that I didn't quite make myself clear enough. Perhaps this person thought I was referring to other path strings in the code. However, to me it sounds a lot like they made the same conceptual mistake that the PHP team made when they "fixed" NUL injections.
The PHP solution was to add a special "p" flag for converting path strings. This happens for all PHP functions declared in C (via zend_parse_parameters()). By the way, notice how this is a new flag. There probably are tons of PHP extensions out there which aren't using this flag yet. Also, who can verify that they managed to find all the strings in PHP which represent paths?
The PHP team was completely missing the point here. This fix means that path arguments aren't allowed to have embedded NUL bytes. Other string type arguments are not checked. They are missing the fact that this isn't just a path issue. Rather, as I described before, it's a fundamental mismatch at the language boundary where strings are translated from the host language to C. However, there seems to be a widespread belief that this can only be exploited in path strings.
I'm not entirely sure why this is, but I can guess. First off, "poisoned NUL byte" attacks have been popularized by a 1999 Phrack article. This article shows a few attacks, but only the path examples are really convincing. Of course, another reason is that injecting NUL bytes in path strings really is the most obvious and practical way to exploit web scripts.
Recently, however, different NUL byte attacks have been documented. For example, they can be used to truncate LDAP and SQL queries and to bypass regular expression filters on SQL input, but you could argue these are all examples of failure to escape correctly. I found a more convincing example in the (excellent!) book The Tangled Web: it contains a one-sentence warning about using HTML sanitation C libraries from other languages. Also, NUL bytes can sometimes be used to hide attacks from log files.
However, the most impressive recent exploit is without a doubt this common vulnerability in SSL certificate verification systems. In an attack, an embedded NUL byte causes a certificate to be accepted for "www.paypal.com", when the CN (Common Name) section (that is, the server's hostname) actually contains the value "www.paypal.com\0.thoughtcrime.org". Certificate authorities generally just accepted this as a valid subdomain of "thoughtcrime.org", ignoring the NUL byte. Client programs (like web browsers) tended to use C string comparison functions, which stop at the NUL byte. Luckily, this was widely reported, and has been fixed in most programs.
I believe that NUL byte mishandling represents a big and mostly untapped source of vulnerabilities. High-level languages are gaining popularity over C for client-side programs, but many crucial libraries are still written in C. This combination means that the problem will grow unless this is structurally fixed in language implementations.
Structurally fixing injection bugs Posted on 2012-09-23
The two biggest threats to the web are caused by the same underlying mistake. It is time this problem was fixed at its root. This article will attempt to provide the tools do so.
Input sanitation, input filtering or output escaping?
The Open Web Application Security Project (OWASP) does a great job at educating people and suggesting practical solutions to avoid common weaknesses. Unfortunately, most security bloggers focus on vulnerabilities rather than the prevention of attacks, and those that do often give bad advice. For example, common advice is to avoid XSS (cross-site scripting) and SQL injection bugs by "sanitizing" or "validating" input. Now, by itself this is good advice.
Unfortunately, the phrase "sanitize your inputs" is often misunderstood and the advice itself can be misguided. For example, Chris Shiflett says:
If you reject [anything but alphanumerics], Berners-Lee and O'Reilly will be rejected, despite being valid last names. However, this problem is easily resolved. A quick change to also allow single quotes and hyphens is all you need to do. Over time, your input filtering techniques will be perfected.
I think this advice is a little unhealthy. Those are valid names, and rejecting them will only scare away customers and reinforce the idea that the "security Nazis" are out to inconvenience people. I wish people would place less emphasis on filtering and sanitizing. Citing this XKCD comic has become a cliché, which (while funny) makes it worse:

Validating and sanitizing input is good when it refers to parsing input into meaningful values immediately when receiving it, so that you don't, say, get a URL when you are expecting an integer. The horror story of ROBCASHFLOW shows how important input restrictions can be (but see also this cautionary list. Tl;dr: you're doomed either way).
However, input sanitation will (in general) not prevent XSS or SQL injection. The OWASP XSS prevention "cheat-sheet" recognizes input validation and sanitation for what it is; a good secondary security measure in a broader "defense in depth" strategy.
Instead, there are only two correct ways to prevent "injection" bugs. The best is often even omitted from advice, which is to avoid the problem entirely (see below). The other is to escape output. Unfortunately, advice to escape often seems to imply that you should manually call escaping procedures; "just use mysql_real_escape_string()". This is a very bad idea; it's tedious, it's easy to forget, it makes code less readable and it requires everyone working on the code to be equally informed about security issues (a great idea, but not very realistic).
Let's investigate how we can prevent these vulnerabilities easily and automatically. This will help us secure applications in a structural rather than an ad-hoc way.
The trouble with strings
The underlying problem of all these vulnerabilities is that a tree structure (e.g., the SQL script's AST or the HTML DOM tree) is represented as a string, and user input which should be a node in the tree is inserted into this string. If this includes characters from the meta-language which describes the tree's structure, it can influence that structure. Here's an example:
<p>{username} said the following: {message}</p>
When message is "So you see, if a<b and c<a, then b>c.", you get output like this (depending on the browser, HTML version and phase of the moon):
Math teacher said the following: So you see, if ac.
This code is simply incorrect, and this bug will frustrate users like the math teacher. But this can turn into a security nightmare; any punk can make you look like a fool by making your images dance around, taking over your users' sessions by stealing cookies, or do much worse. The underlying reason this nonsense is possible at all is the fact that you are mixing user input strings with HTML.
In other words, you're performing string surgery on the serialized representation of a tree structure. Just stop and think how insane that really sounds! Why don't we use real data types? While researching this topic, I found an insightful article called "Safe String Theory for the Web". The author has a good grasp on the problem and comes close to the solution, but he never transcends the idea of representing everything as a string.
Many people don't, so despite the flawed concept, there are several solutions that take string splicing as a given. For instance, some frameworks have a sort of "safe HTML buffers", which automatically HTML-escape strings. These solutions don't deal with the context problem from "Safe String Theory for the Web". There's only one built-in string type, and making it context-aware is extremely hard, maybe even impossible. Strongly typed languages have an advantage here, though!
Representing HTML as a tree helps preventing injection bugs, and has other advantages over automatic escaping. For example, we need to worry less about generating invalid HTML; our output is always guaranteed to be well-formed. The essence of an XSS attack is that it breaks up your document structure and re-shapes it. These are just two sides to the same coin: By taking control of the HTML's shape, XSS is also avoided.
There's another, more insidious problem with splicing HTML strings, which I haven't seen discussed much either. It's another context problem; if your complex web application contains many "helper" functions, it becomes very hard to keep track of which helper functions accept HTML and which accept text. For example, is the following PHP function safe?
function render_latest_topicslist() {
$out = '';
foreach(Forum::latestPosts(10) as $topic) {
$link = render_link('forum/show/'.(int)$topic['id'], $topic['title']);
$out .= "<li>{$link}</li>";
}
return "<ul id=\"latest-topics\">{$out}</ul>";
}
This is (of course) a trick question. Consider:
$dest = htmlspecialchars($dest, ENT_QUOTES, 'UTF-8');
echo render_link($dest_url, "<span>Go to <em>{$dest}</em> directly.</span>");
Either this second example is wrong and the tags will come out literally (i.e., as <span>...</span> in the HTML source), or the first example was wrong and you have an injection bug. You can't tell without consulting render_link's API documentation or implementation. With many helper procedures, how can you keep track of which accept fully formed HTML and which escape their input? What happens when a function which auto-encodes suddenly needs to be changed to accept HTML?
This style of programming results in ad-hoc security. You add escaping in just the right places, decided on a case-by-case basis. This is unsafe by default; you must remember to escape, which makes it error-prone. It's also hard to spot mistakes in this style. The alternative to ad-hoc security is structural security: a style which makes it virtually impossible to write insecure code by accident, thus eliminating entire classes of vulnerabilities.
For example, in PHP we could use the DOM library to represent an HTML tree:
function get_latest_topicslist($document) {
$ul = $document->createElement('ul');
$ul->setAttribute('id', 'latest-topics');
foreach(Forum::latestPosts(10) as $topic) {
$title = $document->createTextNode($topic['title']);
$link = get_link($document, 'forum/show/'.(int)$topic['id'], $title);
$li = $document->createElement('li');
$li->appendChild($link);
$ul->appendChild($li);
}
return $ul;
}
And the second example:
$contents = $document->createElement('span');
$contents->appendChild($document->createTextNode('Go to '));
$em = $document->createElement('em');
$em->appendChild($document->createTextNode($dest));
$contents->appendChild($em);
$contents->appendChild($document->createTextNode(' directly.'));
$link = get_link($document, $dest_url, $contents);
Unfortunately, this code is very verbose. The stuff that really matters gets lost in the noise of DOM manipulation. The advantage is that this is safe; text content cannot influence the tree structure, since the type of every function argument is enforced to be a DOM object and string contents are automatically XML-encoded on output.
Language design to the rescue!
Language design can help a great deal to improve security. For example, domain-specific languages like SXML and SSQL can save the programmer from having to remember to escape while writing most "normal", day-to-day code. This frees precious brain cycles to think about more essential things, like the program's purpose. Here's the example again, using SXML:
(define (latest-topics-list)
`(ul (@ (id "latest-topics"))
,(map (lambda (topic)
`(li ,(make-link `("forum" "show" (alist-ref 'id topic))
(alist-ref 'title topic)))))
(forum-latest-posts 10)))
And the second example:
(make-link destination-url `(span "Go to " (em ,destination) " directly."))
This code is safe from XSS, like the PHP DOM example. However, this code is (to a Schemer) just as readable as the naive PHP version. And, most importantly, the safety is achieved without any effort from the programmer.
This shows the immense safety and security advantages that can be gained from language design. Of course, this isn't completely foolproof: We still need to ensure URIs used in href attributes have an allowed scheme like http: or ftp: and not, say, javascript:. Note that input filtering and sanitation can help in situations like these! Also, just like with automatic escaping, strings in sub-languages (like JS or CSS) aren't automatically escaped. However, there is less "magic" involved; this is a representation for HTML, so it's obvious that only HTML meta-characters will be encoded. If we're also using DSLs for sub-languages, this auto-escaping effect can be nested, solving the "context problem" in a way automatic escaping cannot.
SXML rewards programmers for writing safe code by making it look clean, concise, and easy to write. String splicing looks ugly and verbose in Scheme. In plain PHP this looks clean and simple, while DOM manipulation looks ugly. This subtly guides programmers into writing unsafe code. However, there are some PHP libraries that make safe code look clean. For example, Drupal has a "Forms API". It's a little ugly, but it's idiomatic in Drupal, which means code that uses it is considered cleaner than code that doesn't. Facebook is another attractive target for attackers, so they had to come up with a structural solution. Their solution is a language extension called XHP which adds native support for HTML literals.
These solutions are all specific to some codebase, not part of basic PHP. A framework or an existing codebase has "default" libraries, but when writing from scratch most programmers prefer to use what's available in the base language. This means a language should only include libraries that are safe by default. Otherwise, alternative safe libraries have to compete with the standard ones, which is an unfair disadvantage!
Sidestepping the SQL injection problem entirely
Even though it's possible to write safe code in almost any language if you try hard enough, the basic design of a language itself subtly influences how people will program in it by default. Consider the following example, using the Ruby PG gem:
# This code is vulnerable to SQL injection if the variables store user input res = db.query("SELECT first, last FROM users " "WHERE login = '#{login}' " "AND customer = '#{customer}' " "AND department = '#{department}'")
Here we're using string interpolation, which is the expansion of variable names within a string. We saw this before, in PHP, but in Ruby you can drop back to the full language, which makes the safe solution a little easier to write:
# This code is safe res = db.query("SELECT first, last FROM users " "WHERE login = '#{db.escape_string(login)}' " "AND customer = '#{db.escape_string(customer)}' " "AND department = '#{db.escape_string(department)}'")
Still, it looks uglier than the first example.
The documentation says the escape_string method is considered deprecated. That's because sidestepping the problem entirely is much smarter than escaping. This is done by passing the user-supplied values completely separate ("out of band") from the SQL command string. This way, the data can't possibly influence the structure of the command. They are kept separate even in the network protocol, so it is enforced all the way up into the server. As an added bonus, this is only slightly more verbose than the naive version:
# This code is even safer res = db.query("SELECT first, last FROM users " "WHERE login = $1 AND customer = $2 AND department = $3", [login, customer, department])
This scales only to about a dozen parameters. With more, it becomes hard to mentally map the correct parameter to the correct position. A DSL can do this automatically for you. For example, Microsoft's LinqToSQL language extension seems to do this. SSQL currently auto-escapes, but it could transparently be changed to use positional parameters.
Pervasive (in)security through (bad) design
I'm not a native English speaker, so I looked up the word "interpolation" on Merriam-Webster:
interpolate, transitive verb: To alter or corrupt (as a text) by inserting new or foreign matter
To corrupt, indeed!
Interpolation of user-supplied strings is rarely correct, and it puts almost any conceivable safe API at a disadvantage by making the wrong thing easier and shorter to write than the right thing. Beginners, unaware of the security risks, will try to use this "neat feature". It's put in there for a reason, right? Some people are trying to fix string interpolation, which is a noble goal but I wouldn't expect this to be adopted as the "native" string interpolation mechanism in a language any time soon.
The Ruby examples show the importance of good documentation and library design. The docs pointed us in the right direction by marking the escape_string method as deprecated. Its good design is more apparent when contrasting it with the MySQL gem. This has no support for positional arguments in query, having only escape_string and prepare. The latter allows you to pass parameters separately, but it conflates value separation with statement caching and has an unwieldy API. Finally, the docs are quite sparse. Taken together, this all gently nudges developers into the direction of string interpolation by making that the easiest way to do it. Much of this is due to the design of MySQL's wire protocol, which dictates the API of the C library, which in turn guides the design of "high-level" libraries built on top of it.
I think high-level libraries should strive to abstract away unsafe or painful aspects of the lower levels. For example, the Chicken MySQL-client egg emulates value separation:
(conn (string-append "SELECT first, last FROM users "
"WHERE login = '?login' "
"AND customer = '?cust' "
"AND department = '?dept'")
`((?login . ,login) (?cust . ,customer) (?dept . ,department)))
Ruby's MySQL gem could easily have done this, but they chose to restrict themselves to making a thin layer which maps closely to the C library.
Not all is lost with crappy libraries: Abstractions can solve such problems at an even higher level. Rails can safely pass query parameters via Arel, in a database-independent way, even though MySQL is one of the back-ends. This is true for SQLAlchemy, PDO and many others.
Other examples
This section will show more examples of the same bug. They can all be structurally solved in two simple ways: Automatic escaping (by using proper data structures) or passing data separately from the control language. But let's start with one where this won't work :)
Poisoned NUL bytes
As you may know, strings in the C language are pointers to a character array terminated by a NUL (ASCII 0) byte. Many other languages represent strings as a pointer plus a length, allowing NUL "characters" to simply occur in strings, with no special meaning.
This representational mismatch can be a problem when calling C functions from these languages. In many cases, a C character array of the length of the string plus 1 is allocated, the string contents are copied from the "native" string to the array and a NUL byte is inserted at the end. This causes a reinterpretation of the string's value if it contains a NUL byte, which opens up a potential vulnerability to a "poisoned" NUL byte attack.
Let's look at a toy example in Chicken Scheme:
(define greeting "hello\x00, world!") (define calculate-length-in-c (foreign-lambda int "strlen" c-string)) (print "Scheme says: " (string-length greeting)) (print "C says: " (calculate-length-in-c greeting))
As far as Scheme is concerned, the NUL byte is perfectly legal and the string's length is 14, but for C, the string ends after hello, which makes for a length of 5. There is no way in C to "escape" NUL bytes, and we can't sidestep it here, either. Our only option is to raise an error:
Scheme says: 14
Error: (##sys#make-c-string) cannot represent string with
NUL bytes as C string: "hello\x00, world!"
This is a good example of structural security; it doesn't matter whether the programmer is caffeine-deprived, on a tight deadline or simply unaware of this particular vulnerability. He or she is protected from accidentally making this mistake because it's handled at the boundary between C and Scheme, which is exactly where it should be handled.
HTTP response splitting/Header injection
HTTP response splitting and HTTP header injection are two closely related attacks, based on a single underlying weakness.
The idea is simple: HTTP (response) headers are separated by a CRLF combination. If user input ends up in a header (like in a Location header for a redirect), this can allow an attacker to split a header in two by putting a separator in it. Let's say that http://example.com/foo gets redirected to http://example.com/blabla?q=foo.
An attacker can trick someone (or their browser) into following this link (%0d%0a is an URI-encoded CRLF pair):
http://www.example.com/abc%0d%0aSet-Cookie%3a%20SESSION%3dwhatever-i-want
This could cause the victim's session cookie for example.com to be overwritten:
Location: http://www.example.com/blabla?q=abc Set-Cookie: SESSION=whatever-i-want
This is a session fixation attack. For this particular bug, the real solution is of course to properly percent-encode the destination URI, but the general solution can be as simple as disallowing newlines in the header-setting mechanism (e.g., PHP does this since 5.1.2). Doing it in the only procedure which is capable of emitting headers is a structurally secure approach, but it won't protect against all attacks.
For example, even if we disallow newlines it is still possible to set a parameter (attribute) or a second value for a header, splitting it with a semicolon or a comma, respectively:
Accept: text/html;q=0.5, text/{user-type}
If this is done unsafely, extra content-types can be added. They can even be given preference:
Accept: text/html;q=0.5, text/plain;q=0.1, application/octet-stream;q=1.0
Protecting against these sorts of attacks can only be done with libraries which know each header's syntax and use rich objects to represent them. This approach is taken by intarweb and Guile's HTTP library, and is similar to representing HTML as a (DOM) tree. I'm not aware of any other libraries which use fully parsed "objects" to represent HTTP header values.
Running subprocesses
For some reason, often people use a procedure like system() to invoke subprocesses. It's the most convenient way to quickly run a program just like you would from the command line. It will pass this string to the Unix shell, which expands globs ("wildcards") and runs the program:
(system (sprintf "echo \"~A\"" input)) ;; UNSAFE: byebye files"; rm -rf / "
Several languages have specialized syntax for invoking the shell and putting the output in a string using backticks, e.g., `echo hi`. The really bad part is that string interpolation is supported within the backtick operator, e.g., `echo Hi, "{$name}"`. This is dangerous because the shell is yet another interpreter with its own language, and we've learned by now that we shouldn't embed user input directly into a sublanguage. Here too, string interpolation makes the wrong thing very convenient, which increases the risk of abuse and bugs. After all, spaces and quotes are perfectly legal inside filenames, but when used with unsafely interpolated parameters, they will cause errors.
It is possible to escape shell arguments, but it's very tricky: no two shells provide exactly the same command language with the same meta-characters. Is your /bin/sh really bash, dash, ash, ksh or something else? It is even unspecified whether the sh used is /bin/sh.
However, a better approach is often available. Many programming languages offer an interface to one or more members of the POSIX exec() function family. These allow passing the arguments to the program in a separate array, and they don't go through the shell to invoke the program at all. This is faster and a lot more secure.
(use posix) ;; Accepts a list of arguments: (process "echo" (list "Hello, " first-name " " last-name))
By sidestepping the problem we've made it simpler, shorter than the system call above and safer, which is our goal. In languages with string interpolation this will probably be slightly more verbose than the system() version.
There is one small problem: by eliminating a call to the shell, we've also lost the ability to easily construct pipelines. This can be done by calling several procedures, but this is way more complicated than it is in the shell language. The obvious solution to that is to design a safe DSL. This is what the Scheme Shell does with its notation for UNIX pipelines:
;; This will echo back the input, regardless of "special" characters (define output (run/string (| (echo input) (caesar 5) (caesar 21)))) (display output)
Almost as convenient as the backtick operator, but without its dangers.
Summary
Language design can help us write applications which are structurally secure. We should strive to make writing the right thing easier than the wrong thing so that even naively written, quick and dirty code has some chance of being safe. To reach this goal, we can use the following approaches, in roughly decreasing order of safety:
- "Sidestep" the issue by keeping data separated from commands.
- Represent data in proper data structures, not strings. On output, escape where needed.
- Use "safe buffers" which auto-escape concatenated strings.
- If escaping or separation is impossible, raise an error on bad data.
- If all else fails you can escape manually, but use coding conventions that make unsafe code stand out.
These approaches are your first line of defense. Besides using these, you should also filter and sanitize your input. Just don't mistake that as the fix for injection vulnerabilities!
This is the positive advice I can give you. The negative advice is simply to avoid building language or library features which make unsafe code easier to write than safe code. An example of such a feature is string interpolation, which causes more harm than good.
Designing Lispy DSLs, part 4: SSQL Posted on 2012-08-20
Today we'll look at an old, experimental DSL of my own design. I've always referred to it as a failed experiment, but perhaps it's really a successful experiment, because it helped me figure out why this type of DSL doesn't work too well. Whatever the status, I'll use it as an example of what makes a bad DSL.
The DSL in question is SSQL, a way of embedding SQL as S-expressions into Scheme code. Interestingly, it seems I had a bad feeling about it from the start; the initial commit had the following message:
Add another doomed project - ssql
It turned out not to be completely doomed, because Moritz Heidkamp has kindly taken over maintenance and has been improving and polishing the library. I might even use it again for my own projects if I ever get tired of working directly with SQL.
Scoped access
For my day job I used to write a lot of Rails code, and I got tired of the restrictions in ActiveRecord. I have to mention that this was in the days before Arel, which is a great improvement in the way you can use custom queries in Rails.
With ActiveRecord, you could write code that would automatically prevent users from accessing things they shouldn't be able to access with the scoped_access plugin. This allowed you to write things in your controller like the following:
scoped_access Customer def method_scoping ScopedAccess::ClassScoping.new(Customer, :user => {:id => current_user.id}) end
I don't recall exactly how it worked, but when you had a complex query, this could cause clashes when the same table was joined in twice, especially if the condition was complex. In different situations, different queries could be generated. Back then, you also needed to know internally-generated join aliases in order to scope related tables. Remember, this was quite a while ago, and I was a bit of a newbie and had been programming Ruby and Rails for only a year or two. There may have been better ways to do this even then.
In any case, this scoping problem annoyed me no end and I knew there had to be a better way. It was obvious that if you represent the query in a more complex data structure than a simple string, you can easily fetch all the references to a particular table (even if it is aliased), and add some scoping to it. This could be done even if it required the addition of joins, and even if those tables were already joined under arbitrary names, as long as you would alpha-rename all aliases to avoid clashes with user-created aliases.
Here's an example of the SSQL syntax. This example is based on a toy data model for an IMDB-clone with films, actors and their roles in them:
'(select (columns (col actors id firstname lastname)
(col roles character movie_id))
(from actors roles)
(where (and (= (col actors firstname) "Bruce")
(= (col actors lastname) "Campbell")
(= (col actors id) (col roles actor_id)))))
The regular SQL equivalent of this:
SELECT actors.id, actors.firstname, actors.lastname,
roles.character, roles.movie_id
FROM actors, roles
WHERE actors.firstname = 'Bruce'
AND actors.lastname = 'Campbell'
AND actors.id = roles.actor_id;
The SSQL for column selection can be a little ugly or verbose, so it's also allowed to specify columns with a dot instead of the col form (probably a mistake, complicating the DSL design):
'(select (columns actors.id actors.firstname actors.lastname
roles.character roles.movie_id)
(from actors roles)
(where (and (= actors.firstname "Bruce")
(= actors.lastname "Campbell")
(= actors.id roles.actor_id))))
The columns "noise word" is still required, because that makes it easier to walk the expression and programmatically manipulate it. In any case, scoping a table is easy, even for arbitrarily complex cases:
(let ((query '(select (columns actors.firstname actors.lastname roles.character movies.title) (from (join left (join left actors (join inner roles (as movies m2) (on (and (= m2.id roles.movie_id) (> m2.year 2000)))) (on (= roles.actor_id actors.id))) movies (on (= movies.id roles.movie_id))))))) (scope-table 'movies '(< (col movies year) 2005) query)) ;; Results in the following: (select (columns actors.firstname actors.lastname roles.character movies.title) (from (join left (join left actors (join inner roles (as movies m2) (on (and (= m2.id roles.movie_id) (> m2.year 2000)))) (on (= roles.actor_id actors.id))) movies (on (= movies.id roles.movie_id)))) (where (and (< (col m2 year) 2005) (< (col movies year) 2005))))
The initial query selects all the films in the database, including all actors with the roles they played in that film. However, the actors are only included for films that were released after the year 2000. Earlier films are returned without the actors.
Now, the magic happens in the call to scope-table, which returns the same query, but with all occurrences of the movies table scoped to include only films released before the year 2005. Note that this scopes both the main query and the joined table m2 even though it's aliased.
It's all about the syntax
Okay, so it turns out that this idea works beautifully. Let's look at why I think this DSL was a failure. One reason is the fact that SQL is a huge language, especially when you consider all the extensions provided by various implementations.
You could say "but you don't have to support the full language". That's true, but the problem with a language that maps directly to SQL is that users will expect being able to do all the things they can do in regular SQL. For example, when Common Table Expressions were first introduced into PostgreSQL, I started seeing many places in my code bases at work where those would be useful. The same was true for Window Functions. These are both extremely useful extensions, and I'm now making regular use of them. I wouldn't want to miss them, so any SQL DSL really needs to support them for me to take it seriously.
The thing both extensions have in common is that they introduce completely new syntax. That's because there are absolutely no common building blocks for language constructs; every feature is a set of arbitrarily-placed keywords to help a parser make sense of it (with many optional "noise" keywords to help a human make sense of it). This means each feature has to be taught separately to SSQL, resulting in a large set of rules on how to convert them to SQL.
The SQL grammar is so complicated that its sheer size has serious performance implications on a parser, as pointed out by this blog post. Because EXPLAIN is a PostgreSQL extension, they simply decided to change this command's syntax to make it faster to parse. The old syntax is still supported for backwards-compatibility, but this change is a great illustration of how much of a moving target the SQL syntax really is. Other SQL implementations don't generally move as fast as PostgreSQL in adding features, but as I indicated earlier, I really like these features and use them on a regular basis.
Database independence with SQL-based syntax?
Another complication is supporting multiple databases. SSQL supports ANSI SQL as a baseline, with optional extensions that are available if the back-end supports it. The nice thing is that this provides a degree of database independence. All back-ends can automatically quote strings and table names correctly depending on the database, making SQL injection bugs effectively impossible. For example,
'(select (columns (col actors firstname lastname birth-date))
(from actors)
(where (= actors.lastname "O'Neill")))
gets output as the following in PostgreSQL and SQLite:
SELECT actors.firstname, actors.lastname, actors."birth-date" FROM actors WHERE actors.lastname = 'O''Neill';
The MySQL back-end outputs the following:
SELECT actors.firstname, actors.lastname, actors.`birth-date` FROM actors WHERE actors.lastname = 'O\'Neill';
These differences are relatively small and don't affect the syntax of the S-expression version. However, there are other examples that do. For example, MySQL's INSERT statement allows syntax which mirrors the UPDATE statement, using SET:
INSERT INTO movies SET title = 'Alien', year = 1979;
whereas PostgreSQL only allows the standard syntax (which MySQL also supports):
INSERT INTO movies (title, year) VALUES ('Donnie Darko', 2001);
The question then becomes whether the (unnecessary) syntax with SET should be allowed, and, if so, whether this should be emulated in PostgreSQL by rewriting it to the standard syntax. There are tens of such silly examples (CONCAT versus || versus logical OR, case insensitive LIKE versus ILIKE, etc), but there are a lot of more fundamental differences, too. Finally, using ANSI as a baseline is nice, but many of ANSI's features aren't widely implemented. Common Table Expressions are a good example; they're standardized, but neither MySQL nor SQLite support them, and Postgres only started supporting them very recently. Oh, and fuck proprietary RDBMSes; Oracle long ignored ANSI and invented more nonstandard extensions than MySQL ever did, and as a result, their users are as clueless about ANSI SQL as the average MySQL user. Finally, there are many ANSI features that none of these databases support. This means you have to implement a feature in ANSI, then override it to produce an error message saying it's unsupported in this dialect for all implementations that don't support it. An alternative approach is to implement no base but make everything completely implementation-specific. However, this results in a bigger risk of producing DSL inconsistencies between dialects.
A better approach
Recently, the relational algebra has been gaining some more interest. For example, there's Alf for Ruby and the UNIX shell, and of course Arel, which I mentioned earlier.
I think this is a better approach; relational algebra has just a handful of concepts and there's no syntax associated with it, so you can invent your own syntax to best fit your DSL. It also prevents you from getting distracted by the differences in various SQL implementations. You can see this with Alf already; it has total abstraction over the DBMS. It can use flat files or SQL, or any other back-end you'd like, as long as it fits the relational model (to be fair, so can PostgreSQL with SQL/MED foreign data wrappers). The flip side of such a high level of abstraction is that it will be harder to make use of any killer features offered by your RDBMS; you get the lowest common denominator in features.
Optimizing queries also becomes hard. You can no longer hand-optimize them when writing them, and you'd probably end up with an optimizer in your library. This is pretty insane, since there's also an SQL optimizer and query planner inside your RDBMS, so you're doing twice the work, and there's twice the opportunity for getting it wrong.
Despite these disadvantages, the "relational algebra DSL" approach is more viable than the "SQL DSL" approach. ClojureQL also initially took the approach of providing a DSL closely modeled on SQL, but later completely revised the DSL to be more abstract and closer to relational algebra than to SQL.
I think it's interesting to see what other SQL-like DSL projects will do. For example, Clojure also has Korma, which is rather close to SQL and looks like it can currently only perform a limited subset of all possible queries. I wonder what they'll do when users start clamoring for richer back-end support? Racket used to have SchemeQL, but that project seems to have vanished from the web. The website of its parent project, Schematics, doesn't mention it at all anymore. The same seems to have happened to a Common Lisp interface called CL-RDBMS (at least the "homepage" link currently points to a broken web site).
There's a popular library for Common Lisp called CLSQL. It looks like an enormous amount of engineering went into it. If that's required to get a useful SQL DSL, it might not be worth it unless the advantages outweigh the effort required. Note that even after 10 years of development, CLSQL still has no outer join support. I think that's indicative of how hard it is to properly support SQL from a DSL.
Wrap-up
The lessons I learned from the SSQL experiment are in retrospect rather simple, and seem to echo earlier blog posts:
- The language you're targeting should be small and have few core concepts.
- The relevant standards should be fully implemented in all back-ends you want to support.
- Back-ends shouldn't have any arbitrary extensions that you're expected to support.
- Look for an underlying theory; this may be a better abstraction than the target language.
- Try to find examples of similar libraries. Did others try, and fail or give up? If so, why? How complex are existing implementations? Are they complete?
This post will be the last post in this series, at least for a while. There aren't that many other interesting DSLs with which I'm familiar, and I've exhausted the list of novel design concepts that I'm able to distill from existing DSLs.