A Tour of Xr0

Table of Contents

Introduction

Xr0 is a new verifier for the C programming language that eliminates many classes of undefined behaviour, such as use-after-frees (of which double frees are a special case), dangling and null pointer dereferences, memory leaks and the use of uninitialised memory. Most importantly, we believe Xr0 is on a clear trajectory towards enabling programming in C with no undefined behaviour whatever.

Xr0 works for a sizeable subset of ANSI C (C89) currently. As a work in progress, its most significant limitation is that we have not yet implemented verification for loops and recursion, so these are bridged by axiomatic statements supplied by the user. As an example for the kinds of programs it can handle, see the tests in the repository.

This is the programmer’s introduction to Xr0. We will assume that you are familiar with C and focus on explaining how you can use Xr0 to remove safety vulnerabilities from your code.

Like Xr0, this tutorial is a work in progress, so if you don’t understand something, assume that the error is in the tutorial (or Xr0). We’ll be on the Xr0 Zulip regularly to answer any questions.

Installation

Getting Xr0

The first thing to do is clone the repository, hosted here and here. To clone from GitHub:

$ git clone https://github.com/xr0-org/xr0 && cd xr0

Next we want to build Xr0:

$ make

If all goes well, the binary should be present at ./bin/0v. To confirm that everything is OK, run the tests:

$ make test

If all the tests pass, you should be good to use 0v.

The rest of the tutorial assumes 0v is available in the PATH environment variable, which you may do by symlinking or moving it, as you deem appropriate. For example:

$ ln -s /path/to/xr0/bin/0v /usr/local/bin/0v

Configuring libx

Xr0 relies on libx, our annotated version of the C Standard library.

Create a file named hello.x with the following contents:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define GREETING "hello, world!"

char *
greeting() ~ [ return malloc(strlen(GREETING) + 1); ]
{	
	char *s;

	s = malloc(strlen(GREETING) + 1);
	strcpy(s, GREETING);
	return s;
}

main()
{
	char *s;

	s = greeting();
	puts(s);
	free(s);
}

The rest of the tutorial will explain what the annotation to greeting above means. Run the following command to verify the file (replacing /path/to/xr0 with the path to the Xr0 project folder):

$ 0v -I /path/to/xr0/libx hello.x

You should get no output, indicating successful verification of the whole file. However, if you try to verify it without specifying the libx folder,

$ 0v hello.x

you will get an error like the following:

hello.x:1:10: fatal error: 'stdio.h' file not found
#include <stdio.h>
         ^~~~~~~~~
1 error generated.
function: `malloc' not found

To use 0v in this way we have to set the environment variable XR0_INCLUDES, so you could include a line like

export XR0_INCLUDES=/path/to/xr0/libx

inside your bashrc or equivalent, and then you’ll be able to run Xr0 like this:

$ 0v hello.x

Compiling

Apart from the annotations enclosed with ~[ ], Xr0 files contain pure C. Thus to compile files we first strip these sections and then use regular C compilers.

The strip -s option produces a file named 0.c accordingly:

$ 0v -s hello.x

We can compile and execute the file:

$ cc 0.c && ./a.out
hello, world!

To specify the output file, use -o, as in

$ 0v -so hello.c hello.x
$ cc hello.c -o hello
$ ./hello
hello, world!

Verbose mode

To get verbose output use the -v flag:

$ 0v -v hello.x
qed greeting
qed main

One-by-one: first examples of verification

A C program consists of one or more functions
– Brian Kernighan, Programming in C – A Tutorial.

Xr0 verifies a C program by verifying its functions one-by-one. This simple fact has many implications.

A trivially safe function

Within a function, the errors are often easy to detect. This is particularly the case with functions that only rely on the primitive features of C and its standard library, functions that do not call other user-defined functions. badfoo below is one such function, and if we give it to Xr0 it immediately complains about the leak:

#include <stdlib.h>

badfoo()
{
        malloc(1);
}
badfoo: garbage on heap

We get the same error if we write

#include <stdlib.h>

badfoo()
{
        void *p;

        p = malloc(1);
        p = NULL;
}

or even

#include <stdlib.h>

badfoo()
{
        void *p;

        p = malloc(1);
        p = malloc(1);
        free(p);
}

The error means that at the end of badfoo’s body, there is some memory on the heap that is unreachable, garbage as it is called. We can fix this by freeing the memory:

#include <stdlib.h>

foo()
{
        free(malloc(1));
}

Now Xr0 happily verifies the function, saying nothing by default because “silence is golden”. However, if we are in verbose mode Xr0 will output for us

qed foo

to indicate that foo is verified. We will often show verbose output in this introduction for clarity.

Xr0 has no problem with us freeing the memory in other valid ways:

#include <stdlib.h>

foo()
{
        void *p;

        p = malloc(1);
        free(p);
}
qed foo

We can even double (or triple) up, provided we clean up after ourselves:

#include <stdlib.h>

foo()
{
        void *p;

        p = malloc(1);
        free(p);
        p = malloc(1);
        free(p);
}
qed foo

We say that foo is trivially safe because there is no way to call it that is unsafe.1 Thus Xr0 is able to verify it independently; it doesn’t have to look at other functions.

A less-trivially safe function; abstracts; safety semantics

The other way to fix the error is to return a pointer to the memory:

#include <stdlib.h>

void *
foo()
{
        return malloc(1);
}

But Xr0 has a problem with this:

foo: actual and abstract states differ

What’s going on? Remember that Xr0 verifies functions one-by-one. Although foo is a perfectly reasonable function that could be used safely, e.g.

bar()
{
        void *p = foo();
        free(p);
}

it is also possible to use it unsafely:

bar()
{
        foo(); /* leak */
}

Thus foo is not trivially safe, and Xr0 cannot, by looking at foo as it is, verify whether or not it will be used safely.

Whenever a function is not trivially safe, Xr0 requires that we annotate it, specifying how it may be used safely. Here foo has a side effect (allocation of memory on the heap) that its callers need to handle, each either by freeing the returned memory, or by again passing the buck to the next function in the call stack. This is annotated as follows:

#include <stdlib.h>

void *
foo() ~ [ return malloc(1); ]
{
        return malloc(1);
}
qed foo

The annotation inside the square brackets [ ] is called the abstract of the function. It is similar to what is called a contract in some other systems. (Obviously, this annotation syntax is not supported by most C compilers, so Xr0 is bundled with facilities for stripping the ~ [ ] regions. See the section on compiling above for the details.) An abstract expresses everything that callers need to know to safely use the function. These concerns that callers must handle are called the safety semantics of the function.

foo’s abstract here is simple, it says (to functions that will use foo) “foo returns a reference to a heap-allocated region of memory, so you better deal with it”. Thus Xr0 has no objection if we change the body while maintaining consistency with its abstract:

#include <stdlib.h>

void *
foo() ~ [ return malloc(1); ]
{
        void *p;

        p = malloc(1);
        return p;
}
qed foo

We can even write

#include <stdlib.h>

void *
foo() ~ [ return malloc(1); ]
{
        void *p; void *q;

        p = malloc(1);
        q = malloc(2);
        free(q);
        return p;
}
qed foo

Above the region whose address is stored in q and then freed has no net effect on the behaviour of foo, so it doesn’t influence its safety semantics and doesn’t need to be in the abstract.

However, if we change the body such that the function no longer returns a reference to a region on the heap, Xr0 will not let us maintain the same abstract:

#include <stdlib.h>

void *
badfoo() ~ [ return malloc(1); ]
{
        void *p;

        p = malloc(1);
        free(p);
        return p;
}
badfoo: actual and abstract states differ

All that matters to Xr0 is that the abstract capture the safety semantics of the function’s body. We call this correspondence internal consistency, because it means that the function can be treated as a self-contained unit of verification. Thus Xr0 can verify an internally consistent function independently of the rest of the program.

Propagation

Let’s go back to the foo that returns a pointer to a region on the heap. Observe that if we call it we have to be just as careful as we were when calling malloc:

#include <stdlib.h>

void *
foo() ~ [ return malloc(1); ]
{
        return malloc(1);
}

bar()
{
        foo();
}
qed foo
bar: garbage on heap

This is because Xr0 is propagating foo’s safety semantics to bar, using foo’s abstract.

Again, we can resolve this by freeing the memory:

#include <stdlib.h>

void *
foo() ~ [ return malloc(1); ]
{
        return malloc(1);
}

bar()
{
        free(foo());
}
qed foo
qed bar

If instead we want bar to return the memory, we must again say so in its abstract:

#include <stdlib.h>

void *
foo() ~ [ return malloc(1); ]
{
        return malloc(1);
}

void *
bar() ~ [ return foo(); ]
{
        return foo();
}
qed foo
qed bar

This abstract stresses bar’s dependence on foo. However, Xr0 also allows us to annotate bar in this way:

#include <stdlib.h>

void *
foo() ~ [ return malloc(1); ]
{
        return malloc(1);
}

void *
bar() ~ [ return malloc(1); ]
{
        return foo();
}
qed foo
qed bar

This is simply transitivity: because bar’s body is equivalent to foo’s, its abstract can be also. In either case, the concerns in foo’s body are propagated through its abstract into bar, and bar in turn must either handle these concerns or propagate them through its own abstract.

On the same principles of equivalence, we can use foo in an abstract alone, as in baz below:

#include <stdlib.h>

void *
foo() ~ [ return malloc(1); ]
{
        return malloc(1);
}

void *
bar() ~ [ return malloc(1); ]
{
        return foo();
}

void *
baz() ~ [ return foo(); ]
{
        return malloc(1);
}
qed foo
qed bar
qed baz

baz invokes foo abstractly (in its abstract) rather than actually (in its body), whereas bar invokes foo actually but not abstractly.

Although these examples may seem simple, they already show one of the most important principles behind Xr0. It empowers and forces the programmer to segment his program into functions that have well-defined safety semantics. This segmentation has two benefits:

  1. It enables independent verification of the functions

  2. It produces a program whose safety semantics are propagated through the levels of its call stack.

The first criterion makes Xr0 feasible, the second makes it useful. The second means that one can change parts of a large program without fearing it will undermine the safety of the whole, and should enable a kind of fearless collaboration that is not currently possible in C.

Avoiding double frees with setups

Abstracts can do much more than describe the return value of a function.

For example, when working with composite objects, it becomes convenient to write constructor/destructor pairs:

#include <stdlib.h>

struct point { int x; int y; };

struct point * 
mkpoint(int x, int y)
{
        struct point *p;

        p = malloc(sizeof(struct point));
        p->x = x;
        p->y = y;
        return p;
}

void
delpoint(struct point *p)
{
        free(p);
}

main()
{
        struct point *p;

        p = mkpoint(0, 0);
        /* do other things with p */
        delpoint(p);
}

Predictably, Xr0 complains about mkpoint:

mkpoint: actual and abstract states differ

and we fix this by giving it an abstract:

struct point * 
mkpoint(int x, int y) ~ [ return malloc(sizeof(struct point)); ]
{
        struct point *p;

        p = malloc(sizeof(struct point));
        p->x = x;
        p->y = y;
        return p;
}

This abstract is correct because Xr0 enforces semantic equivalence only insofar as it relates to safety. The assignments to p->x and p->y do not affect the safety semantics at callers of mkpoint. Thus mkpoint will verify, but attention now shifts to the destructor:

#include <stdlib.h>

struct point { int x; int y; };

struct point * 
mkpoint(int x, int y) ~ [ return malloc(sizeof(struct point)); ]
{
        struct point *p;

        p = malloc(sizeof(struct point));
        p->x = x;
        p->y = y;
        return p;
}

void
delpoint(struct point *p)
{
        free(p);
}

main()
{
        struct point *p;

        p = mkpoint(0, 0);
        /* do other things with p */
        delpoint(p);
}
qed mkpoint
delpoint: cannot exec statement: undefined free of value not pointing at heap

On the surface, it may seem like Xr0 is confused, because this is perfectly safe code. delpoint is only called once, in main, and at that moment its argument is indeed pointing at the heap. What’s going on?

Once again we must bear in mind that Xr0 insists on verifying each function separately. Thus when it looks at delpoint, it cannot see that it is called only once and safely so. Instead, treating delpoint as a self-contained unit of verification, it recognises correctly that its parameter p may not be pointing at the heap. We have to annotate delpoint to indicate this restriction on p. As we said in the previous section, we have to express in a function’s abstract everything its callers need to know to use it safely, we have to express the function’s safety semantics. We can express the need for p to point at the heap as follows:

void
delpoint(struct point *p) ~ [
        setup: p = malloc(sizeof(struct point));
]{
        free(p);
}

The setup label demarcates its adjoined statement as the setup of the function. Xr0 will use this statement as a requirement or precondition wherever delpoint is called. If the requirement imposed by the setup is unsatisfied, it will throw an error.

But how do we understand the requirement itself, what does

p = malloc(sizeof(struct point));

mean here? It means that delpoint can be safely used only upon the assumption that its parameter was initialised thusly, or with something semantically equivalent. For example, because as we noted the values of the point are not part of the safety semantics of mkpoint, we could just as easily use a setup like this:

p = mkpoint(0, 0);

or, for that matter,

p = mkpoint(-5, 12);

because these are all semantically equivalent by virtue of mkpoint’s abstract.

Now we’ve accounted for delpoint’s requirement that its parameter be the address of a heap-allocated region, but are we done? Let’s give the function to the verifier:

#include <stdlib.h>

void
delpoint(struct point *p) ~ [
        setup: p = malloc(sizeof(struct point));
]{
        free(p);
}
delpoint: actual and abstract states differ

Because our setup is correct, Xr0 no longer raises an objection from the call to free; the problem now is delpoint’s side effect of deallocating the memory pointed to by p has not been stated. We must include this side effect to satisfy the verifier:

#include <stdlib.h>

void
delpoint(struct point *p) ~ [
        setup: p = malloc(sizeof(struct point));
        free(p);
]{
        free(p);
}
qed delpoint

Splendid. Now look at what happens now if we call delpoint invalidly:

#include <stdlib.h>

struct point { int x; int y; };

struct point * 
mkpoint(int x, int y) ~ [ return malloc(sizeof(struct point)); ]
{
        struct point *p;

        p = malloc(sizeof(struct point));
        p->x = x;
        p->y = y;
        return p;
}

void
delpoint(struct point *p) ~ [
        setup: p = malloc(sizeof(struct point));
        free(p);
]{
        free(p);
}

badfoo()
{
        struct point p;

        p.x = 3;
        p.y = 4;
        delpoint(&p);
}
qed mkpoint
qed delpoint
badfoo: cannot exec statement: parameter p of delpoint must be heap allocated

Or consider this double-free:

#include <stdlib.h>

struct point { int x; int y; };

struct point * 
mkpoint(int x, int y) ~ [ return malloc(sizeof(struct point)); ]
{
        struct point *p;

        p = malloc(sizeof(struct point));
        p->x = x;
        p->y = y;
        return p;
}

void
delpoint(struct point *p) ~ [
        setup: p = malloc(sizeof(struct point));
        free(p);
]{
        free(p);
}

badbar()
{
        struct point *p;

        p = mkpoint(0, 0);
        free(p);
        delpoint(p);
}
qed mkpoint
qed delpoint
badbar: cannot exec statement: parameter p of delpoint must be lvalue

Thus we’re protected against all possible classes of invalid frees, because Xr0 requires that we know in context the pointer supplied to free has a valid heap address.

Returning to the example and putting things together:

#include <stdlib.h>

struct point { int x; int y; };

struct point * 
mkpoint(int x, int y) ~ [ return malloc(sizeof(struct point)); ]
{
        struct point *p;

        p = malloc(sizeof(struct point));
        p->x = x;
        p->y = y;
        return p;
}

void
delpoint(struct point *p) ~ [
        setup: p = malloc(sizeof(struct point));
        free(p);
]{
        free(p);
}

main()
{
        struct point *p;

        p = mkpoint(0, 0);
        /* do other things with p */
        delpoint(p);
}
qed mkpoint
qed delpoint
qed main

Initialisation of memory and the clump

One of C’s strengths is it never does anything without being told to do so, just like a computer. But because we are not computers, their unhesitating obedience often surprises us, because they do what we say instead of what we mean. Thus C becomes, to borrow the common metaphor, like a gun without a safety, good for killing friends and foes alike. A good instance of this is how C handles initialisation of memory.

In the absence of explicit initialisation, external and static variables are initialised to zero, but automatic variables are not initialised and neither is the memory returned from calls to malloc. If we run

#include <stdio.h>

int
main()
{
        int i;
        return i;
}

the behaviour is undefined because i’s value is indeterminate.

This rule allows the programmer to decide whether he wants to pay the performance penalty of initialisation, and to make this decision on a case-by-case basis. Even more importantly, it gives the programmer complete control over the operation of the machine. But it also allows him to forget to initialise memory, which he inevitably does, to his own hurt or the hurt of users of his program. The gun shoots faster, whether intentionally or unintentionally, killing friend and foe and even oneself.

Xr0 offers a way for us to maintain both power and safety. For example, if we give it code containing this error it notices immediately:

#include <stdio.h>

int
main()
{
        int i;
        return i;
}
main: cannot exec statement: undefined memory access: i has no value

This works even if we give Xr0 something more devious:

#include <stdio.h>

int
main()
{
        int i; int *p;

        p = &i;
        return *p;
}
main: cannot exec statement: undefined indirection: *(p+0) has no value

Thus the programmer is protected against using an uninitialised value, but not against having one if he deems it convenient. Xr0 doesn’t stop him from initialising:

#include <stdio.h>

int
main()
{
        int i;

        i = 0;
        return i;
}
qed main

This holds true even with more roundabout initialisations:

#include <stdio.h>

int
main()
{
        int i; int *p;

        p = &i;
        *p = 0;
        return i + *p;
}
qed main

So far, so good, but this remains unimpressive because it all takes place within a function, and as we said at the beginning, errors within a function are easy to spot. What if we want to initialise a value using a function? The following is perfectly valid C:

#include <stdio.h>

init(int *p, int val)
{
        *p = val;
}

int
main()
{
        int i;
        init(&i, 0);
        return i;
}

But as before, Xr0 takes issue with it:

init: cannot exec statement: undefined indirection: *(p) is not an lvalue

Again, Xr0 is treating the function as self-contained. It cannot see that init is only called by main, and main gives it a pointer that is pointing at a valid location. So it throws the error because it cannot know that every possible invocation of init will supply a value of p that can be dereferenced. So we have to indicate this requirement of init’s to Xr0.

The way we indicate this is somewhat surprising, but becomes intuitive with use.

init(int *p, int val) ~ [
        setup: p = .clump(sizeof(int));
        *p = val;
]{
        *p = val;
}

This is quite the mouthful. The second line in the abstract is the least shocking part here: it simply indicates that init’s side effect is to assign val to *p. Now for the strange part, the setup statement:

p = .clump(sizeof(int));

This line should look somewhat familiar to C programmers, as a close cousin of

p = malloc(sizeof(int));

Indeed, the meaning of .clump is similar to malloc, but also subtly different.

The first thing to clarify is that the dot ‘.’ at the beginning of .clump is not an operator. It is simply there to disambiguate between .clump and the potentially user-defined identifier clump, which could be, for example, the name of one of the parameters. Xr0 uses this disambiguating dot in front of all its special keywords, as we shall see. (These keywords do not augment C and are only permitted in annotations. Xr0 is not the new C++.)

The .clump built-in function tells Xr0 to allocate memory, like malloc, of the size of its argument. However, .clump does not allocate memory at runtime, because Xr0 is a compile-time tool. Rather, .clump allocates memory hypothetically and abstractly, as Xr0 analyses the function in which it’s found. Finally, this memory is not thought of as being necessarily on the heap, or in any other given location, but rather, as somewhere. In other words, the memory is simply assumed to exist. The command

setup: p = .clump(sizeof(int));

thus expresses the assumption that the first parameter of init is a pointer to a valid region of memory, whether static or automatic or heap-allocated. When Xr0 is verifying init’s internal consistency it will hypothetically allocate the memory; when init is called from another function Xr0 will insist that the first argument be a valid pointer.

Let’s look at some more code in order to understand these distinctions.

#include <stdio.h>

init(int *p, int val) ~ [
        setup: p = .clump(sizeof(int));
        *p = val;
]{
        *p = val;
}

int
main()
{
        int i;
        init(&i, 0);
        return i;
}
qed init
qed main

Xr0 is now able to appreciate that init has initialised i, as well as that init has been used safely, because the &i passed to init is a pointer to valid memory somewhere, to valid memory “on the clump”.

Similarly, if instead we pass in a pointer to a heap-allocated region of memory, Xr0 raises no objection:

#include <stdlib.h>

init(int *p, int val) ~ [
        setup: p = .clump(sizeof(int));
        *p = val;
]{
        *p = val;
}

int
main()
{
        int i; int *p;

        p = malloc(sizeof(int));
        init(p, 0);
        i = *p;
        free(p);
        return i;
}
qed init
qed main

Xr0 can see that the p supplied to init is pointing to valid memory on the clump, because it is pointing at valid memory somewhere.

On the other hand, if we give init something that is not on the clump, Xr0 will detect the error:

#include <stdlib.h>

init(int *p, int val) ~ [
        setup: p = .clump(sizeof(int));
        *p = val;
]{
        *p = val;
}

int
main()
{
        int *p;

        p = NULL;
        init(p, 0);
        return *p;
}
qed init
main: cannot exec statement: parameter p of init must be lvalue

Thus Xr0 is able to give the programmer assurances of complete safety, without asking him to sacrifice an iota of his power. The control over initialisation remains as it always has in C, but the possibility of an accidental use of uninitialised memory has been completely eliminated.

Limitations

Due to its early stage of development, Xr0 currently has quite a few limitations. Most of the limitations are a product of our current focus on getting the project scaffolding right, but a few reflect things we’re still working through conceptually, like exactly what kind of invariant or similar mechanism will be right for loops and recursion.

These lists are obviously not comprehensive.

Syntax

Several productions in the C89 grammar are commented out of our yacc file. This means that we currently throw syntax errors for some valid C. When you run into a syntax error, it could very well be that we’re not yet supporting the feature in question.

We currently do not support the following:

Type checking

We are not performing any of the type checking that the Standard requires. This is a solved problem and will be introduced later; for now we defer this responsibility to the user’s chosen C compiler.

Loops and recursion

We do not verify for-loops, but rather, take their annotated effects axiomatically.

We also do not support recursion, or recursion on functions that have side effects requiring annotations.

Buffers

We currently model all buffers as infinitely-sized. We will be adding out-of-bounds checking soon.

printf

The printf family of functions are currently not enabled because we haven’t implemented variadic arguments.

Bugs


  1. This holds true even if the allocation fails, because according to the Specification a null pointer must be returned, and freeing a null pointer is a well-defined operation for which “no action occurs”. That being said, one of Xr0’s current limitations is it doesn’t account for failures to allocate, but we’ve chosen against diving into that at this stage of the introduction. ↩︎