Splint the static C code checker

3 05 2009

Splint is this great tool to statically analyze C code. Splint is an evolution of Lint. Lint makes analysis of C code and verifies unused declarations, type inconsistencies, use before definitions, unreachable code, ignored return values, execution path with no return and infinite loops. However, Lint was not a sufficiently powerful tool. Splint was created as a descendant of Lint allowing annotations in C code, so, it do more checks than its predecessor. In this post I will present some verifications that Splint does, and show the philosophy and how to use it. Splint allows annotations on functions, variables, parameters and types.

Undefined Variables

Splint detects instances where the value of a variable is used before it is defined. Anyway, many static analyzers can also do this, it is a very basic check. However, thanks to the splint annotations supports, they can be used to describe what storage must be defined and what storage must be undefined at interface points, this means that all storage reachable from a global variable, parameter to a function, or function return value is defined before and after the function call.

A special case of undefined variables is undefined function parameters. Sometimes we do a certain C function that return values, or change parameters values. The out annotation denotes a pointer to storage that may be undefined. Splint have a great storage model, so does not report an error when a pointer to allocated but undefined storage is passed as an out parameter. If in the body of a function an outparameter is allocated but not bounded to a value Splint reports an error. You can see out as an parameter that will be defined inside a function. The opposite happens to the in annotation, this can be used to denote a parameter that must be completely defined.
Here is a small example of it:

extern void setVal (/*@out@*/ int *x);
extern int getVal (/*@in@*/ int *x);
extern int mysteryVal(int *x);

int dumbfunc(/*@out@*/ int *x, int i) {
    if (i > 3)
11         return *x;
    else if (i > 1)
13         return getVal (x);
    else if (i == 0)
15         return mysteryVal (x);
    else {
18         setVal (x);
19         return *x;
    }
}

And here it is the output of Splint:

> splint usedef.c
usedef.c:11: Value *x used before definition
usedef.c:13: Passed storage x not completely defined
    (*x is undefined): getVal (x)
usedef.c:15: Passed storage x not completely defined
    (*x is undefined): mysteryVal (x)
Finished checking --- 3 code warnings

As you can see, no error is reported for line 18, because the incomplete defined storage x is passed to a function that will define it, so we can return *x with no problems.

Types

Strong type checking often reveals programming errors. Splint can check primitive C types more strictly and flexibly than typical compilers.

C, does not make types verification. Because for C all basic non numerical types are just int’s (Enum, Pointer, Char). Splint supports stricter checking of built-in C types. The char and enum types can be checked as distinct types, and the different numeric types can be type-checked strictly. Also the primitive char type can be type-checked as a distinct type. If char is used as a distinct type, common errors involving assigning ints to chars are detected.

In Splint, you can always turn off some default verifications, and Splint will assume the C compiler strategy to verify the code. If you run Splint with +charint is on, char types are indistinguishable from ints. So you can make cast’s from int to char or from cahr to int. But never forgot that this can lead to errors, as you can imagine if you do (char) 1000 you will not get an expected result because the characters only go up to 256.

Splint reports errors in the use of Enums, if a value that is not an enumerator member is assigned to the enum type or if an enum type is used as an operand to an arithmetic operator. Anyway you can turn this off, activating the enumint flag and then you can use enums and ints types interchangeably.

Memory Management

About half the bugs in typical C programs can be attributed to memory management problems.

Any C programmer who has already made a reasonable size program, has been confronted with unexpected errors related with memory management problems. This happens basically because C is a low level language so, does not have any system like Java’s garbage collector. All the memory management have to be done by the programmer, so errors may append. Some only appear sporadically, and some may only be apparent when compiled on a different platform. Splint is able to detect many memory management errors at compile time, like: using storage that may have been deallocated, memory leaks or returning a pointer to stack-allocated storage. This is possible because Splint makes a memory model when is working on the code, anyway I will not talk about this subject.

Splint can see and report deallocating storage when there are other live references to the same storage and failing to deallocate storage before the last reference to it is lost. The solution here is simple, Splint makes an obligation to release storage and attach this obligation to the reference to which the storage is assigned.

For references not shared Splint uses the only annotation to indicate that a reference is the only pointer to the object it points to, here we also use the null annotation to say that the output of malloc could be NULL:

/* @only@ */ /* @null@ */ void * malloc ( size_t size );

Here is a simple example of C annotated code with memory leaks and use after free variables:

1 extern /*@only@*/ int *glob;
/*@only@*/ int *f (/*@only@*/ int *x, int *y, int *z) /*@globals glob;@*/ {
8    int *m = (int *)
9    malloc (sizeof (int));
11   glob = y;   //Memory leak
12   free (x);
13   *m = *x;   //Use after free
14    return z;   //Memory leak detected
}

Here is the Splint output for the file above:

> splint only.c
only.c:11: Only storage glob (type int *) not released before assignment: glob = y
only.c:1: Storage glob becomes only
only.c:11: Implicitly temp storage y assigned to only: glob = y
only.c:13: Dereference of possibly null pointer m: *m
only.c:8: Storage m may become null
only.c:13: Variable x used after being released
only.c:12: Storage x released
only.c:14: Implicitly temp storage z returned as only: z
only.c:14: Fresh storage m not released before return
only.c:9: Fresh storage m allocated

Splint errors and warnings are very human readable, so you just have to read them to understand.
The first warning says that variable glob was not released (we say, in line 1, that it is only) before the assignment. In line 13 we have a dereference of possibly null pointer, because in line 12 we have done free to x variable, an now we want to use it’s value. As you can see, is very easy to understand the output of Splint.

Splint can also make verifications in stack based references. A memory error occurs if a pointer into stack is live after the function returns. Splint detects errors involving stack references exported from a function through return values or assignments to references reachable from global variables or actual parameters.
No annotations are needed to detect stack reference errors. It is clear from declarations if storage is allocated on the function stack.

Here is an example of stack-allocated storage:

int *glob;

int *f (int **x) {
    int sa[2] = { 0, 1 };
    int loc = 3;
9   glob = &loc;
10  *x = &sa[0];
12  return &loc;
}

And here is the Splint output:

> splint stack.c
stack.c:12: Stack-allocated storage &loc reachable from return value: &loc
stack.c:12: Stack-allocated storage *x reachable from parameter x
stack.c:10: Storage *x becomes stack
stack.c:12: Stack-allocated storage glob reachable from global glob
stack.c:9: Storage glob becomes stack

Control Flow

In order to avoid certain errors, Splint have to understand the control flow of the program, so Splint do some checks related to control flow. Many of these checks are possible because of the extra information that is known in annotations. Without this additional information Splint assumes that all functions return and execution continues normally.

noreturn annotation is used to denote a function that never returns.

extern /* @noreturn@ */ void fatalerror ( char *s);

We also have maynoreturn and alwaysreturns annotations, but Splint must assume that a function returns normally when checking the code and doesn’t verify if a function really returns.

To describe non-returning functions the noreturnwhentrue and noreturnwhenfalse mean that a function never returns if the first argument is true or false.

/* @noreturnwhenfalse@ */ void assert (/* @sef@ */ bool /* @alt int@ */ pred );

The sef annotation denotes a parameter as side effect free, and the alt int indicate that it may be either a Boolean or an integer.

Undefined Behavior

The order which side effect take place in C is not entirely defined by the code.

A sequence point is some point in the C code where it is guaranteed that all side effects of previous evaluations have been performed.
An example of sequence points is:

  • a function call (after the arguments have been evaluated)
  • at the end of a if, while, for or do statement
  • a &&, || and ?

Here is a simple C program that have undefined behavior.

extern int glob;
extern int mystery (void);
extern int modglob (void) /*@globals glob@*/ /*@modifies glob@*/;

int f (int x, int y[]) {
11    int i = x++ * x;
13    y[i] = i++;
14    i += modglob() * glob;
15    i += mystery() * glob;
16    return i;
}

And here is the Splint output:

> splint order.c
order.c:11: Expression has undefined behavior (value of right operand modified by left operand): x++ * x
order.c:13: Expression has undefined behavior (left operand uses i, modified by right operand): y[i] = i++
order.c:14: Expression has undefined behavior (value of right operand modified by left operand): modglob() * glob
order.c:15: Expression has undefined behavior (unconstrained function mystery used in left operand may set global variable glob used in right operand): mystery() * glob

Regard control flow, Splint has more options to check C code. I will not talk about all of them here.

Buffer Sizes

Buffer overflow errors are a particularly dangerous type of bug in C, they are responsible for half of all security attacks. This is happens because C does not perform runtime bound checking (for performance reasons), and so attackers can exploit program bugs to, for example, gain full access to a machine.

Splint models blocks of memory using two properties:
maxSet, maxSet(b) denotes the highest address beyond b that can be safely used as lvalue, for instance:

char buffer[MAXSIZE] we have maxSet(buffer) = MAXSIZE - 1

and maxRead, maxRead(b) denotes the highest index of a buffer that can be safely used as rvalue.

When a buffer is accessed as an lvalue, Splint generates a precondition constraint involving the maxSet property and when a buffer is accessed as an rvalue, Splint generates a precondition constraint involving the maxRead property.

We can use this two properties in buffer sizes annotations, this annotations uses the clause requires and ensures just like JML, VDM and Frama-C. When a function with requires clause is called, the call site must be checked to satisfy the constraints implied by requires.

Here is an example of its use:

void /* @alt char * @*/ strcpy
(/* @out@ */ /* @returned@ */ char *s1 , char *s2)
/* @modifies * s1@ */
/* @requires maxSet (s1) >= maxRead (s2) @*/
/* @ensures maxRead (s1) == maxRead (s2) @*/;

This is a possible annotation for strcpy standard library C function. We say that s1 is the return value of the function (returned), and that the pointer to s1 is the only one (only). We also say that this function modifies s1 and we specify the pre and post conditions.

Splint can also do bound checking. This is more complex than other checks done by Splint, so, memory bound warnings contain extensive information about the unresolved constraint as you can see in this example:

int buf [10];
buf [10] = 3;
setChar .c :5:4: Likely out -of - bounds store : buf [10]
Unable to resolve constraint : requires 9 >= 10
needed to satisfy precondition : requires maxSet ( buf @ setChar .c :5:4) >= 10

The ultimate test: wu-ftpd

wu-ftpd version 2.5.0 was about 20.000 lines of code and took less than four seconds to check all of wu-ftpd on a 1.2-GHz Athlon machine. Splint detected the known flaws as well as finding some previously unknown flaws (!)

One of the problems of these automatic static analysis tools is that it can produce false problems.
Running Splint on wu-ftpd without adding annotations produced 166 warnings for potential out-of-bounds writes. After adding 66 annotations, it produced 101 warnings: 25 of these indicated real problems and 76 were false.

Pros and Cons

Pros:

  • Lightweight static analysis detects software vulnerabilities
  • Splint definitely improves code quality
  • Suitable for real programs…

Cons:

  • …although it produces more warning messages that lead to confusion
  • It won’t eliminate all security risks
  • Hasn’t been developed since 2007, they need new volunteers

Conclusions

No tool will eliminate all security risks, lightweight static analysis tools (Splint) play an important role in identifying security vulnerabilities.

Here is the presentation about that subject:

Outroduction

All the examples provide here is from the Splint manual.





Deborah Gordon: How do ants know what to do?

2 05 2009

Deborah Gordon is a biologist. She digs up ant colonies in the Arizona desert in search of keys to understanding complex systems.





Carl Orff – O Fortuna Imperatrix Mundi

12 04 2009

Great cantata from Carl Orff, I think everybody knows that one – Carmina Burana.





Easter

11 04 2009

via





Garrett Lisi: A beautiful new theory of everything

10 04 2009

Antony Garrett Lisi is a theoretical physicist best known for “An Exceptionally Simple Theory of Everything”. This TED talk is about that theory. Garrett explains how he find this theory. Garrett Lisi has proposed this new “theory of everything” a grand unified theory that explains all the elementary particles, as well as gravity.





More Hylomorphisms in Haskell

9 04 2009

If you lost yourself in this post, I advise you to start in catamorphisms, then anamorphisms and then hylomorphisms.

Like I said before (in those posts) when you write an hylomorphism over a particular data type, that means just that the intermediate structure is that data type.

In fact that data will never be stored into that intermediate type C or D. Because we glue the ana and cata together into a single recursive pattern. A and E could be some data type your function need. With this post I will try to show you more hylomorphisms over some different data types to show you the power of this field.

Leaf Tree’s

The data type that we going to discuss here is the LTree. In Haskell we can represent LTree as:

data LTree a = Leaf a | Fork (LTree a, LTree a)

Is just like a binary tree, but the information is just in the leaf’s. Even more: a leaf tree is a tree that only have leaf’s, no information on the nodes. This is an example of a leaf tree:

To represent all the hylomorphisms over Ltree we draw the following diagram:

The example I’m going to give is making the fibonacci function using a hylomorphism over this data type. If you remember the method I used before, I’m going to start by the anamorphism [(h)]. Before that I’m going to specify the strategy to define factorial. I’m going to use the diagram’s again, remember that type 1 is equivalent to Haskell ( ):

As you can see I’m going to use Ltree~1 as my intermediate structure, and I’ve already define the names of my gen functions add to the catamorphism and fibd to the anamorphism. The strategy I prefer, is do all the hard work in the anamorphism, so here the gen fibd for the anamorphism is:

fibd n | n < 2     = i1   ()
       | otherwise = i2   (n-1,n-2)

This function combined with the anamorphism, going to generate leaf tree’s with n leaf’s, being n the result of that fib.

Then we just have to write the gen add for the catamorphism. This function (combined with the catamorphism) counts the number of leafs that a leaf tree have.

add = either (const 1) plus
    where plus = uncurry (+)

The final function, the fibonacci function is the hylomorphism of those two defined before:

fib =  hyloLTree add fibd

Here is all the auxiliary functions you need to run this example:

inLTree = either Leaf Fork

outLTree :: LTree a -> Either a (LTree a,LTree a)
outLTree (Leaf a)     = i1   a
outLTree (Fork (t1,t2)) = i2    (t1,t2)

cataLTree a = a . (recLTree (cataLTree a)) . outLTree

anaLTree f = inLTree . (recLTree (anaLTree f) ) . f

hyloLTree a c = cataLTree a . anaLTree c

baseLTree g f = g -|- (f >< f)

recLTree f = baseLTree id f

Lists

The lists that I’m going to talk here, are the Haskell lists, wired into the compiler, but is a definition exist, it will be:

data [a] = [ ] | a : [a]

So, our diagram to represent the hylomorphism over this data type is:

The function I’m going to define as a hylomorphism is the factorial function. So, we know that our domain and co-domain is Integers, so now we can make a more specific diagram to represent our solution:

As you can see I’m going to use [Integer] to represent my intermediate data, and I’ve already define the names of my gen functions mul to the catamorphism and nats to the anamorphism. Another time, that I do all the work with the anamorphism, letting the catamorphism with little things to do (just multiply). I’m start to show you the catamorphism first:

mul = either (const 1) mul'
    where mul' = uncurry (*)

As you can see the only thing it does is multiply all the elements of a list, and multiply by 1 when reach the [] empty list.

In the other side, the anamorphism is generating a list of all the elements, starting in n (the element we want to calculate the factorial) until 1.

nats = (id -|- (split succ id)) . outNat

And finally we combine this together with our hylo, that defines the factorial function:

fac = hylo mul nats

Here is all the code you need to run this example:

inl = either (const []) (uncurry (:))

out []    = i1 ()
out (a:x) = i2(a,x)

cata g   = g . rec (cata g) . out   

ana h    = inl . (rec (ana h) ) . h

hylo g h = cata g . ana h

rec f    = id -|- id >< f

Binary Tree’s

Here, I’m going to show you the hanoi problem solved with one hylomorphism, first let’s take a look at the Btree structure:

data BTree a = Empty | Node(a, (BTree a, BTree a))

So, our generic diagram representing one hylomorphism over BTree is:

There is a well-known inductive solution to the problem given by the pseudocode below. In this solution we make use of the fact that the given problem is symmetrical with respect to all three poles. Thus it is undesirable to name the individual poles. Instead we visualize the poles as being arranged in a circle; the problem is to move the tower of disks from one pole to the next pole in a specified direction around the circle. The code defines H_n.d to be a sequence of pairs (k, d) where n is the number of disks, k is a disk number and d are directions. Disks are numbered from 0 onwards, disk 0 being the smallest. Directions are boolean values, true representing a clockwise movement and false an anti-clockwise movement. The pair (k, d) means move the disk numbered k from its current position in the direction d.

excerpt from R. Backhouse, M. Fokkinga / Information Processing Letters 77 (2001) 71–76

So, here, I will have a diagram like that, b type stands for Bool and i type for Integer:

I’m going to show all the solution here, because the description of the problem is in this quote, and in the paper:

hanoi = hyloBTree f h

f = either (const []) join
    where join(x,(l,r))=l++[x]++r

h(d,0) = Left ()
h(d,n+1) = Right ((n,d),((not d,n),(not d,n)))

And here it is, all the code you need to run this example:

inBTree :: Either () (b,(BTree b,BTree b)) -> BTree b
inBTree = either (const Empty) Node

outBTree :: BTree a -> Either () (a,(BTree a,BTree a))
outBTree Empty              = Left ()
outBTree (Node (a,(t1,t2))) = Right(a,(t1,t2))

baseBTree f g = id -|- (f >< g))

cataBTree g = g . (recBTree (cataBTree g)) . outBTree

anaBTree g = inBTree . (recBTree (anaBTree g) ) . g

hyloBTree h g = cataBTree h . anaBTree g

recBTree f = baseBTree id f

Outroduction

Maybe in the future I will talk more about that subject.