Xr0 is a new verifier for the C programming language that eliminates many classes of undefined behaviour, such as use-after-frees (of which double frees are a special case), dangling and null pointer dereferences, memory leaks and the use of uninitialised memory. Most importantly, we believe Xr0 is on a clear trajectory towards enabling programming in C with no undefined behaviour whatever.
Xr0 works for a sizeable subset of ANSI C (C89) currently. As a work in progress, its most significant limitation is that we have not yet implemented verification for loops and recursion, so these are bridged by axiomatic statements supplied by the user. As an example for the kinds of programs it can handle, see the tests in the repository.
This is the programmer’s introduction to Xr0. We will assume that you are familiar with C and focus on explaining how you can use Xr0 to remove safety vulnerabilities from your code.
Like Xr0, this tutorial is a work in progress, so if you don’t understand something, assume that the error is in the tutorial (or Xr0). We’ll be on the Xr0 Zulip regularly to answer any questions.
The first thing to do is clone the repository, hosted here and here. To clone from GitHub:
$ git clone https://github.com/xr0-org/xr0 && cd xr0
Next we want to build Xr0:
$ make
If all goes well, the binary should be present at ./bin/0v
.
To confirm that everything is OK, run the tests:
$ make test
If all the tests pass, you should be good to use 0v
.
The rest of the tutorial assumes 0v
is available in the PATH
environment
variable, which you may do by symlinking or moving it, as you deem appropriate.
For example:
$ ln -s /path/to/xr0/bin/0v /usr/local/bin/0v
Xr0 relies on libx, our annotated version of the C Standard library.
Create a file named hello.x
with the following contents:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define GREETING "hello, world!"
char *
greeting() ~ [ return malloc(strlen(GREETING) + 1); ]
{
char *s;
s = malloc(strlen(GREETING) + 1);
strcpy(s, GREETING);
return s;
}
main()
{
char *s;
s = greeting();
puts(s);
free(s);
}
The rest of the tutorial will explain what the annotation to greeting
above
means.
Run the following command to verify the file (replacing /path/to/xr0
with the
path to the Xr0 project folder):
$ 0v -I /path/to/xr0/libx hello.x
You should get no output, indicating successful verification of the whole file. However, if you try to verify it without specifying the libx folder,
$ 0v hello.x
you will get an error like the following:
hello.x:1:10: fatal error: 'stdio.h' file not found
#include <stdio.h>
^~~~~~~~~
1 error generated.
function: `malloc' not found
To use 0v
in this way we have to set the environment variable XR0_INCLUDES
, so
you could include a line like
export XR0_INCLUDES=/path/to/xr0/libx
inside your bashrc or equivalent, and then you’ll be able to run Xr0 like this:
$ 0v hello.x
Apart from the annotations enclosed with ~[ ]
, Xr0 files contain pure C.
Thus to compile files we first strip these sections and then use regular C
compilers.
The strip -s
option produces a file named 0.c
accordingly:
$ 0v -s hello.x
We can compile and execute the file:
$ cc 0.c && ./a.out
hello, world!
To specify the output file, use -o
, as in
$ 0v -so hello.c hello.x
$ cc hello.c -o hello
$ ./hello
hello, world!
To get verbose output use the -v
flag:
$ 0v -v hello.x
qed greeting
qed main
A C program consists of one or more functions…
– Brian Kernighan, Programming in C – A Tutorial.
Xr0 verifies a C program by verifying its functions one-by-one. This simple fact has many implications.
Within a function, the errors are often easy to detect.
This is particularly the case with functions that only rely on the primitive
features of C and its standard library, functions that do not call other
user-defined functions.
badfoo
below is one such function, and if we give it to Xr0 it immediately
complains about the leak:
#include <stdlib.h>
badfoo()
{
malloc(1);
}
badfoo: garbage on heap
We get the same error if we write
#include <stdlib.h>
badfoo()
{
void *p;
p = malloc(1);
p = NULL;
}
or even
#include <stdlib.h>
badfoo()
{
void *p;
p = malloc(1);
p = malloc(1);
free(p);
}
The error means that at the end of badfoo
’s body, there is some
memory on the heap that is unreachable, garbage as it is called.
We can fix this by freeing the memory:
#include <stdlib.h>
foo()
{
free(malloc(1));
}
Now Xr0 happily verifies the function, saying nothing by default because “silence is golden”. However, if we are in verbose mode Xr0 will output for us
qed foo
to indicate that foo
is verified. We will often show verbose output in this
introduction for clarity.
Xr0 has no problem with us freeing the memory in other valid ways:
#include <stdlib.h>
foo()
{
void *p;
p = malloc(1);
free(p);
}
qed foo
We can even double (or triple) up, provided we clean up after ourselves:
#include <stdlib.h>
foo()
{
void *p;
p = malloc(1);
free(p);
p = malloc(1);
free(p);
}
qed foo
We say that foo
is trivially safe because there is no way to call it that is
unsafe.1
Thus Xr0 is able to verify it independently; it doesn’t have to look at other
functions.
The other way to fix the error is to return a pointer to the memory:
#include <stdlib.h>
void *
foo()
{
return malloc(1);
}
But Xr0 has a problem with this:
foo: actual and abstract states differ
What’s going on?
Remember that Xr0 verifies functions one-by-one.
Although foo
is a perfectly reasonable function that could be used safely,
e.g.
bar()
{
void *p = foo();
free(p);
}
it is also possible to use it unsafely:
bar()
{
foo(); /* leak */
}
Thus foo
is not trivially safe, and
Xr0 cannot, by looking at foo
as it is, verify whether or not it will be
used safely.
Whenever a function is not trivially safe, Xr0 requires that we annotate it,
specifying how it may be used safely.
Here foo
has a side effect (allocation of memory on the heap) that its callers
need to handle, each either by freeing the returned memory, or by again passing
the buck to the next function in the call stack.
This is annotated as follows:
#include <stdlib.h>
void *
foo() ~ [ return malloc(1); ]
{
return malloc(1);
}
qed foo
The annotation inside the square brackets [ ]
is called the abstract
of the function.
It is similar to what is called a contract in some other systems.
(Obviously, this annotation syntax is not supported by most C compilers, so Xr0
is bundled with facilities for stripping the ~
[
]
regions.
See the section on compiling above for the details.)
An abstract expresses everything that callers need to know to safely use the
function.
These concerns that callers must handle are called the safety semantics of the
function.
foo
’s abstract here is simple, it says (to functions that will use foo
)
“foo
returns a reference to a heap-allocated region of memory, so you better
deal with it”.
Thus Xr0 has no objection if we change the body while maintaining consistency
with its abstract:
#include <stdlib.h>
void *
foo() ~ [ return malloc(1); ]
{
void *p;
p = malloc(1);
return p;
}
qed foo
We can even write
#include <stdlib.h>
void *
foo() ~ [ return malloc(1); ]
{
void *p; void *q;
p = malloc(1);
q = malloc(2);
free(q);
return p;
}
qed foo
Above the region whose address is stored in q
and then freed has no net effect
on the behaviour of foo
, so it doesn’t influence its safety
semantics and doesn’t need to be in the abstract.
However, if we change the body such that the function no longer returns a reference to a region on the heap, Xr0 will not let us maintain the same abstract:
#include <stdlib.h>
void *
badfoo() ~ [ return malloc(1); ]
{
void *p;
p = malloc(1);
free(p);
return p;
}
badfoo: actual and abstract states differ
All that matters to Xr0 is that the abstract capture the safety semantics of the function’s body. We call this correspondence internal consistency, because it means that the function can be treated as a self-contained unit of verification. Thus Xr0 can verify an internally consistent function independently of the rest of the program.
Let’s go back to the foo
that returns a pointer to a region on the heap.
Observe that if we call it we have to be just as careful as we were when calling
malloc
:
#include <stdlib.h>
void *
foo() ~ [ return malloc(1); ]
{
return malloc(1);
}
bar()
{
foo();
}
qed foo
bar: garbage on heap
This is because Xr0 is propagating foo
’s safety semantics to bar
, using
foo
’s abstract.
Again, we can resolve this by freeing the memory:
#include <stdlib.h>
void *
foo() ~ [ return malloc(1); ]
{
return malloc(1);
}
bar()
{
free(foo());
}
qed foo
qed bar
If instead we want bar
to return the memory, we must again say so in its
abstract:
#include <stdlib.h>
void *
foo() ~ [ return malloc(1); ]
{
return malloc(1);
}
void *
bar() ~ [ return foo(); ]
{
return foo();
}
qed foo
qed bar
This abstract stresses bar
’s dependence on foo
.
However, Xr0 also allows us to annotate bar
in this way:
#include <stdlib.h>
void *
foo() ~ [ return malloc(1); ]
{
return malloc(1);
}
void *
bar() ~ [ return malloc(1); ]
{
return foo();
}
qed foo
qed bar
This is simply transitivity: because bar
’s body is equivalent to foo
’s, its
abstract can be also.
In either case, the concerns in foo
’s body are propagated through its abstract
into bar
, and bar
in turn must either handle these concerns or propagate
them through its own abstract.
On the same principles of equivalence, we can use foo
in an abstract alone, as
in baz
below:
#include <stdlib.h>
void *
foo() ~ [ return malloc(1); ]
{
return malloc(1);
}
void *
bar() ~ [ return malloc(1); ]
{
return foo();
}
void *
baz() ~ [ return foo(); ]
{
return malloc(1);
}
qed foo
qed bar
qed baz
baz
invokes foo
abstractly (in its abstract) rather than actually (in
its body), whereas bar
invokes foo
actually but not abstractly.
Although these examples may seem simple, they already show one of the most important principles behind Xr0. It empowers and forces the programmer to segment his program into functions that have well-defined safety semantics. This segmentation has two benefits:
It enables independent verification of the functions
It produces a program whose safety semantics are propagated through the levels of its call stack.
The first criterion makes Xr0 feasible, the second makes it useful. The second means that one can change parts of a large program without fearing it will undermine the safety of the whole, and should enable a kind of fearless collaboration that is not currently possible in C.
Abstracts can do much more than describe the return value of a function.
For example, when working with composite objects, it becomes convenient to write constructor/destructor pairs:
#include <stdlib.h>
struct point { int x; int y; };
struct point *
mkpoint(int x, int y)
{
struct point *p;
p = malloc(sizeof(struct point));
p->x = x;
p->y = y;
return p;
}
void
delpoint(struct point *p)
{
free(p);
}
main()
{
struct point *p;
p = mkpoint(0, 0);
/* do other things with p */
delpoint(p);
}
Predictably, Xr0 complains about mkpoint
:
mkpoint: actual and abstract states differ
and we fix this by giving it an abstract:
struct point *
mkpoint(int x, int y) ~ [ return malloc(sizeof(struct point)); ]
{
struct point *p;
p = malloc(sizeof(struct point));
p->x = x;
p->y = y;
return p;
}
This abstract is correct because Xr0 enforces semantic equivalence
only insofar as it relates to safety.
The assignments to p->x
and p->y
do not affect the safety semantics at
callers of mkpoint
. Thus mkpoint
will verify, but attention now shifts to
the destructor:
#include <stdlib.h>
struct point { int x; int y; };
struct point *
mkpoint(int x, int y) ~ [ return malloc(sizeof(struct point)); ]
{
struct point *p;
p = malloc(sizeof(struct point));
p->x = x;
p->y = y;
return p;
}
void
delpoint(struct point *p)
{
free(p);
}
main()
{
struct point *p;
p = mkpoint(0, 0);
/* do other things with p */
delpoint(p);
}
qed mkpoint
delpoint: cannot exec statement: undefined free of value not pointing at heap
On the surface, it may seem like Xr0 is confused, because this is perfectly safe
code.
delpoint
is only called once, in main
, and at that moment its argument is
indeed pointing at the heap.
What’s going on?
Once again we must bear in mind that Xr0 insists on verifying each function
separately.
Thus when it looks at delpoint
, it cannot see that it is called only once and
safely so.
Instead, treating delpoint
as a self-contained unit of verification, it
recognises correctly that its parameter p
may not be pointing at the heap.
We have to annotate delpoint
to indicate this restriction on p
.
As we said in the previous section, we have to express in a function’s abstract
everything its callers need to know to use it safely, we have to express the
function’s safety semantics.
We can express the need for p
to point at the heap as follows:
void
delpoint(struct point *p) ~ [
setup: p = malloc(sizeof(struct point));
]{
free(p);
}
The setup
label demarcates its adjoined statement as the setup of the
function.
Xr0 will use this statement as a requirement or precondition wherever delpoint
is called.
If the requirement imposed by the setup is unsatisfied, it will throw an error.
But how do we understand the requirement itself, what does
p = malloc(sizeof(struct point));
mean here?
It means that delpoint
can be safely used only upon the assumption that its
parameter was initialised thusly, or with something semantically equivalent.
For example, because as we noted the values of the point are not part of the
safety semantics of mkpoint
, we could just as easily use a setup like
this:
p = mkpoint(0, 0);
or, for that matter,
p = mkpoint(-5, 12);
because these are all semantically equivalent by virtue of mkpoint
’s abstract.
Now we’ve accounted for delpoint
’s requirement that its parameter be the
address of a heap-allocated region, but are we done?
Let’s give the function to the verifier:
#include <stdlib.h>
void
delpoint(struct point *p) ~ [
setup: p = malloc(sizeof(struct point));
]{
free(p);
}
delpoint: actual and abstract states differ
Because our setup is correct, Xr0 no longer raises an objection from the call to
free
; the problem now is delpoint
’s side effect of deallocating the memory
pointed to by p
has not been stated.
We must include this side effect to satisfy the verifier:
#include <stdlib.h>
void
delpoint(struct point *p) ~ [
setup: p = malloc(sizeof(struct point));
free(p);
]{
free(p);
}
qed delpoint
Splendid.
Now look at what happens now if we call delpoint
invalidly:
#include <stdlib.h>
struct point { int x; int y; };
struct point *
mkpoint(int x, int y) ~ [ return malloc(sizeof(struct point)); ]
{
struct point *p;
p = malloc(sizeof(struct point));
p->x = x;
p->y = y;
return p;
}
void
delpoint(struct point *p) ~ [
setup: p = malloc(sizeof(struct point));
free(p);
]{
free(p);
}
badfoo()
{
struct point p;
p.x = 3;
p.y = 4;
delpoint(&p);
}
qed mkpoint
qed delpoint
badfoo: cannot exec statement: parameter p of delpoint must be heap allocated
Or consider this double-free:
#include <stdlib.h>
struct point { int x; int y; };
struct point *
mkpoint(int x, int y) ~ [ return malloc(sizeof(struct point)); ]
{
struct point *p;
p = malloc(sizeof(struct point));
p->x = x;
p->y = y;
return p;
}
void
delpoint(struct point *p) ~ [
setup: p = malloc(sizeof(struct point));
free(p);
]{
free(p);
}
badbar()
{
struct point *p;
p = mkpoint(0, 0);
free(p);
delpoint(p);
}
qed mkpoint
qed delpoint
badbar: cannot exec statement: parameter p of delpoint must be lvalue
Thus we’re protected against all possible classes of invalid frees, because
Xr0 requires that we know in context the pointer supplied to free
has a valid
heap address.
Returning to the example and putting things together:
#include <stdlib.h>
struct point { int x; int y; };
struct point *
mkpoint(int x, int y) ~ [ return malloc(sizeof(struct point)); ]
{
struct point *p;
p = malloc(sizeof(struct point));
p->x = x;
p->y = y;
return p;
}
void
delpoint(struct point *p) ~ [
setup: p = malloc(sizeof(struct point));
free(p);
]{
free(p);
}
main()
{
struct point *p;
p = mkpoint(0, 0);
/* do other things with p */
delpoint(p);
}
qed mkpoint
qed delpoint
qed main
One of C’s strengths is it never does anything without being told to do so, just like a computer. But because we are not computers, their unhesitating obedience often surprises us, because they do what we say instead of what we mean. Thus C becomes, to borrow the common metaphor, like a gun without a safety, good for killing friends and foes alike. A good instance of this is how C handles initialisation of memory.
In the absence of explicit initialisation, external and static variables are
initialised to zero, but automatic variables are not initialised and neither is
the memory returned from calls to malloc
. If we run
#include <stdio.h>
int
main()
{
int i;
return i;
}
the behaviour is undefined because i
’s value is indeterminate.
This rule allows the programmer to decide whether he wants to pay the performance penalty of initialisation, and to make this decision on a case-by-case basis. Even more importantly, it gives the programmer complete control over the operation of the machine. But it also allows him to forget to initialise memory, which he inevitably does, to his own hurt or the hurt of users of his program. The gun shoots faster, whether intentionally or unintentionally, killing friend and foe and even oneself.
Xr0 offers a way for us to maintain both power and safety. For example, if we give it code containing this error it notices immediately:
#include <stdio.h>
int
main()
{
int i;
return i;
}
main: cannot exec statement: undefined memory access: i has no value
This works even if we give Xr0 something more devious:
#include <stdio.h>
int
main()
{
int i; int *p;
p = &i;
return *p;
}
main: cannot exec statement: undefined indirection: *(p+0) has no value
Thus the programmer is protected against using an uninitialised value, but not against having one if he deems it convenient. Xr0 doesn’t stop him from initialising:
#include <stdio.h>
int
main()
{
int i;
i = 0;
return i;
}
qed main
This holds true even with more roundabout initialisations:
#include <stdio.h>
int
main()
{
int i; int *p;
p = &i;
*p = 0;
return i + *p;
}
qed main
So far, so good, but this remains unimpressive because it all takes place within a function, and as we said at the beginning, errors within a function are easy to spot. What if we want to initialise a value using a function? The following is perfectly valid C:
#include <stdio.h>
init(int *p, int val)
{
*p = val;
}
int
main()
{
int i;
init(&i, 0);
return i;
}
But as before, Xr0 takes issue with it:
init: cannot exec statement: undefined indirection: *(p) is not an lvalue
Again, Xr0 is treating the function as self-contained.
It cannot see that init
is only called by main
, and main
gives it a
pointer that is pointing at a valid location.
So it throws the error because it cannot know that every possible invocation of
init
will supply a value of p
that can be dereferenced.
So we have to indicate this requirement of init
’s to Xr0.
The way we indicate this is somewhat surprising, but becomes intuitive with use.
init(int *p, int val) ~ [
setup: p = .clump(sizeof(int));
*p = val;
]{
*p = val;
}
This is quite the mouthful.
The second line in the abstract is the least shocking part here: it simply
indicates that init
’s side effect is to assign val
to *p
.
Now for the strange part, the setup statement:
p = .clump(sizeof(int));
This line should look somewhat familiar to C programmers, as a close cousin of
p = malloc(sizeof(int));
Indeed, the meaning of .clump
is similar to malloc
, but also subtly
different.
The first thing to clarify is that the dot ‘.
’ at the beginning of .clump
is
not an operator.
It is simply there to disambiguate between .clump
and the potentially
user-defined identifier clump
, which could be, for example, the name of one of
the parameters.
Xr0 uses this disambiguating dot in front of all its special keywords, as we
shall see. (These keywords do not augment C and are only permitted in
annotations. Xr0 is not the new C++.)
The .clump
built-in function tells Xr0 to allocate memory, like malloc
, of
the size of its argument.
However, .clump
does not allocate memory at runtime, because Xr0 is a
compile-time tool.
Rather, .clump
allocates memory hypothetically and abstractly, as Xr0
analyses the function in which it’s found.
Finally, this memory is not thought of as being necessarily on the heap, or
in any other given location, but rather, as somewhere.
In other words, the memory is simply assumed to exist.
The command
setup: p = .clump(sizeof(int));
thus expresses the assumption that the first parameter of init
is a pointer to
a valid region of memory, whether static or automatic or heap-allocated.
When Xr0 is verifying init
’s internal consistency it will hypothetically
allocate the memory; when init
is called from another function Xr0 will
insist that the first argument be a valid pointer.
Let’s look at some more code in order to understand these distinctions.
#include <stdio.h>
init(int *p, int val) ~ [
setup: p = .clump(sizeof(int));
*p = val;
]{
*p = val;
}
int
main()
{
int i;
init(&i, 0);
return i;
}
qed init
qed main
Xr0 is now able to appreciate that init
has initialised i
, as well as that
init
has been used safely, because the &i
passed to init
is a pointer to
valid memory somewhere, to valid memory “on the clump”.
Similarly, if instead we pass in a pointer to a heap-allocated region of memory, Xr0 raises no objection:
#include <stdlib.h>
init(int *p, int val) ~ [
setup: p = .clump(sizeof(int));
*p = val;
]{
*p = val;
}
int
main()
{
int i; int *p;
p = malloc(sizeof(int));
init(p, 0);
i = *p;
free(p);
return i;
}
qed init
qed main
Xr0 can see that the p
supplied to init
is pointing to valid memory on the
clump, because it is pointing at valid memory somewhere.
On the other hand, if we give init
something that is not on the clump, Xr0
will detect the error:
#include <stdlib.h>
init(int *p, int val) ~ [
setup: p = .clump(sizeof(int));
*p = val;
]{
*p = val;
}
int
main()
{
int *p;
p = NULL;
init(p, 0);
return *p;
}
qed init
main: cannot exec statement: parameter p of init must be lvalue
Thus Xr0 is able to give the programmer assurances of complete safety, without asking him to sacrifice an iota of his power. The control over initialisation remains as it always has in C, but the possibility of an accidental use of uninitialised memory has been completely eliminated.
Due to its early stage of development, Xr0 currently has quite a few limitations. Most of the limitations are a product of our current focus on getting the project scaffolding right, but a few reflect things we’re still working through conceptually, like exactly what kind of invariant or similar mechanism will be right for loops and recursion.
These lists are obviously not comprehensive.
Several productions in the C89 grammar are commented out of our yacc file. This means that we currently throw syntax errors for some valid C. When you run into a syntax error, it could very well be that we’re not yet supporting the feature in question.
We currently do not support the following:
switch
while
and do-while
loops
goto
Declarations with initialisers, so you have to write
func()
{
int i;
i = 0;
}
instead of
func()
{
int i = 0;
}
Global variables
We are not performing any of the type checking that the Standard requires. This is a solved problem and will be introduced later; for now we defer this responsibility to the user’s chosen C compiler.
We do not verify for
-loops, but rather, take their annotated effects
axiomatically.
We also do not support recursion, or recursion on functions that have side effects requiring annotations.
We currently model all buffers as infinitely-sized. We will be adding out-of-bounds checking soon.
printf
The printf
family of functions are currently not enabled because we haven’t
implemented variadic arguments.
foo
’s abstract references bar
, but bar
’s abstract
references baz
also, there may be some strange behaviour.This holds true even if the allocation fails, because according to the Specification a null pointer must be returned, and freeing a null pointer is a well-defined operation for which “no action occurs”. That being said, one of Xr0’s current limitations is it doesn’t account for failures to allocate, but we’ve chosen against diving into that at this stage of the introduction. ↩︎