From LALR(1) to LL(k)

2 04 2011

I, Pedro Faria and Jose Pedro Silva have been working on a LALR to LL conversor. Our idea was to discover the problems involved in this type of automatic conversion.
Our main goal here was to receive an LALAR(1) grammar and produce an LL(k) one. As a test case we receive an Yacc and produce ANTLR.
For the purpose of testing we used the LogoLISS grammar, a language inspired in LOGO programming language and developed in my university to school grammars and program transformation.

The main Yacc grammar was kind of a big one, here it is:

Liss            : PROGRAM IDENT '{' Body '}'
                ;
Body            : DECLARATIONS Declarations
                     STATEMENTS Statements
                ;
Declarations    : Declaration
                | Declarations Declaration
                ;
Declaration     : Variable_Declaration
                ;
Variable_Declaration : Vars SETA Type ';'
                     ;
Vars        : Var
            | Vars ',' Var
            ;
Var         : IDENT Value_Var
            ;
Value_Var   :
            | '=' Inic_Var
            ;
Type        : INTEGER
            | BOOLEAN
            | ARRAY SIZE NUM
            ;
Inic_Var    : Constant
            | Array_Definition
            ;
Constant    : '(' Sign NUM ')'
            | TRUE
            | FALSE
            ;
Sign        :
            | '+'
            | '-'
            ;
Array_Definition  : '[' Array_Initialization ']'
                  ;
Array_Initialization  : Elem
                      | Array_Initialization ',' Elem
                      ;
Elem        : Sign NUM
            ;
Statements      : Statement
                | Statements Statement
                ;
Statement   : Turtle_Commands
            | Assignment
            | Conditional_Statement
            | Iterative_Statement
            ;
Turtle_Commands  : Step
                 | Rotate
                 | Mode
                 | Dialogue
                 | Location
                 ;
Step     : FORWARD Expression
         | BACKWARD Expression
         ;
Rotate   : RRIGHT
         | RLEFT
         ;
Mode     : PEN UP
         | PEN DOWN
         ;
Dialogue : Say_Statement
         | Ask_Statement
         ;
Location : GOTO NUM ',' NUM
         | WHERE '?'
         ;
Assignment      : Variable '=' Expression
                ;
Variable        : IDENT Array_Acess
                ;
Array_Acess     :
                | '[' Single_Expression ']'
                ;
Expression      : Single_Expression
                | Expression Rel_Op Single_Expression
                ;
Single_Expression  : Term
                   | Single_Expression Add_Op Term
                   ;
Term        : Factor
            | Term Mul_Op Factor
            ;
Factor      : Constant
            | Variable
            | SuccOrPred
            | '(' '!' Expression ')'
            | '(' '+' Expression ')'
            | '(' '-' Expression ')'
            | '(' Expression ')'
            ;
Add_Op      : '+'
            | '-'
            | OU
            ;
Mul_Op      : '*'
            | '/'
            | E
            | MULTMULT
            ;
Rel_Op      : IGUAL
            | DIF
            | '<'             | '>'
            | MENORIGUAL
            | MAIORIGUAL
            | IN
            ;
SuccOrPred  : SuccPred IDENT
            ;
SuccPred    : SUCC
            | PRED
            ;
Say_Statement  : SAY '(' Expression ')'
               ;
Ask_Statement  : ASK '(' STRING ',' Variable ')'
               ;
Conditional_Statement : IfThenElse_Stat
                      ;
Iterative_Statement : While_Stat
                    ;
IfThenElse_Stat     : IF Expression THEN '{' Statements '}' Else_Expression
                    ;
Else_Expression     :
                    | ELSE '{' Statements '}'
                    ;
While_Stat     : WHILE '(' Expression ')' '{' Statements '}'
               ;

This is a toy grammar to help students to understand and get a deep knoledge of some fundamental types of productions in grammars.
The kind of texts this grammar is able to analyse are:

PROGRAM logolissExample {
    DECLARATIONS
    x = (100) , y -> Integer ;
    z -> Boolean ;
    w = TRUE -> Boolean ;
    STATEMENTS
        FORWARD x
        RRIGHT
        y = (100)
        FORWARD e
        RRIGHT
        x = x - (100) + (20)
        FORWARD x + (100)
        RRIGHT
        FORWARD (100)
}

Rules

After some time thinking about this problem we tried to solve the left recursion problem. ANTLR does not know how to handle with left recursion, so we must use the EBNF support to translate this productions.

Grammar rules in BNF provide for concatenation and choice but no specific operation equivalent to the * of regular expressions are provided. In Yacc the only way we get repetition is using the following pattern:

A : A a | a ;

We call this kind of rules a left recursive rule.

So, we discover this two generalized rules to remove left recursion in Yacc grammars (click on the image to expand):

\begin{tabular}{|l|l|c|}\hline\textbf{LALR (BNF)} & \textbf{LL (EBNF)} & Nome da Regra\\\hline\begin{tabular}{ccl}vars & : & var\\& | & vars a b c d e \ldots\end{tabular}&\begin{tabular}{ccl}vars & : & var (a b c d e \ldots)+\end{tabular}&removeLeftRecursionPlusExt\\\hline\begin{tabular}{ccl}vars & : & \\&| & vars a b c d e \ldots\end{tabular}&\begin{tabular}{ccl}vars & : & (a b c d e \ldots)*\end{tabular}&removeLeftRecursionTimesExt\\\hline\end{tabular}

Implementation

So, we start right over the implementation of the conversor. We used Yapp because lately we have been using a lot of Perl, so we are want to get deeper into Perl packages.

We start to implement the definition of a Yacc grammar in Yapp:

productions : production             
            | productions production 
            ;
production  : nonTerminal ':' derivs ';' 
            ;
derivs      : nts            
            | derivs '|' nts 
            |
            ;
nts         : nt     
            | nts nt 
            ;
nt          : terminal    
            | nonTerminal 
            | sep         
            ;
terminal    : STRING_TERMINAL     
            ;
nonTerminal : STRING_NON_TERMINAL
            ;
sep         : SEPARATOR 
            ;

And we move along to represent this information in a structure. We chose an peculiar structure to help us processing it.
This is the structure defined as Haskell types:

type Grammar = [Production]
type Production = HashTable ProductionName [Derivation]
type Derivation = [HashTable NTS Type]
type Type = String
type NTS = String

Easy to find stuff, easy to implement in Perl 🙂
So, this is our Yapp grammar again but this time with semantic actions:

productions : production             { return $_[1]; }
            | productions production {
                                       push @{$_[1]},@{$_[2]};
                                       return $_[1];
                                     }
            ;

production  : nonTerminal ':' derivs ';' { return [ { $_[1] => $_[3] } ]; }
            ;

derivs      : nts            { return [$_[1]]; }
            | derivs '|' nts {
                               push @{$_[1]},$_[3];
                               return $_[1];
                             }
            |                { return [[ { 'epsilon' => 'epsilon' } ]]; }
            ;
nts         : nt     { return $_[1]; }
            | nts nt {
                       push @{$_[1]},@{$_[2]};
                       return $_[1];
                     }
            ;
nt          : terminal    {
                            return [ { $_[1] => 'terminal' } ];
                          }
            | nonTerminal {
                            return [ { $_[1] => 'nonTerminal' } ];
                          }
            | sep         {
                            return [ { $_[1] => 'sep' } ];
                          }
            ;
terminal    : STRING_TERMINAL     {
                                    return $_[1];
                                  }
            ;
nonTerminal : STRING_NON_TERMINAL {
                                    return $_[1];
                                  }
            ;
sep         : SEPARATOR           {
                                    return $_[1];
                                  }
            ;

After we had this structure filled in we applied the two rules shown before and we get rid of left recursion for good.

So, after we process the following grammars with our transformer:

a   : a B C
    | //nothing
    ;
expr : expr "+" expr
     | expr "-" expr
     | expr "*" expr
     ;

We get:

a   : (B C)*
    ;
expr : ( "+" expr)+
     | ( "-" expr)+
     | ( "*" expr)+
     ;

You can see all the code produced (including makefiles) in my github repo.

Other rules

We have noticed that only removing the left recursion makes this grammar work in ANTLR.
We are aware that to fully convert an LALR grammar to LL we need one other step: Left Factoring.

Because our LL system is k-token lookahead we does not have to mind about ambiguous grammars.
Examples like this are not a problem for our destination system.

expr: T A B
    | T C D
    ;

However if our destination grammar was LL(1) we needed to convert this to it’s EBNF form.

Lexer

We also translated the Lex used in flex to ANTLR. The main problem here is that we used only matching functions in Perl and we do not used any grammar for this effect.
The main problem is that the regular expressions in flex are different to regexps in ANTLR.
You can see the Perl code for yourself here.
But you if you want to do a translation from flex to ANTLR better you define a flex grammar.

Advertisements




Permutations in pure functional ActionScript

2 04 2011

I have been working with ActionScript3 and Flex to do apps for Android. I have never used this languages before, so in the beginning I was with my hart open to understand this languages, specially ActionScript3, the only thing I knew about ActionScript is that is used a lot for flash animations and that was it.

At a first glance ActionScript3 seemed to me to be a standard object oriented language, but after I grasp a little bit more and I found that the version 3 has a lot of functional flavor in it.
I started to find more and more examples of it: definition of high-order functions, map, filter and so…

I have to say that I deeply continue to prefer the Haskell notation with concern to high order and the 2 functions, but the point of ActionScript3 was never be totally functional, but incorporate some features of it.

So, outside of work I remember the beautiful definition of permutations in Haskell and I start try to implement the same algorithm in ActionScript3.

perms xs = [x:ps | x <- xs, ps <- perms (xs\\[x])]

For the non Haskell programmers I will explain: We receive the list xs and here use the list comprehension notation, basically we can translate a list comprehension in the form:
[f~x | x \leftarrow list] as a map function this way map~f~list. So, that’s basically the first part of our definition. The second one [x:ps | x \leftarrow list, ps \leftarrow perms(list_2)] is the cartesian product of the two lists.
Finally the definition of the function xs//[x] is simply the exclusion of the element x from the list xs.

So, I first came out with this solution:

public static function perms(xs:Array):Array {
    if(xs.length == 0) {
        return [[]];
    }   
    return xs.map(
        function(x:*,index:uint,array:Array):* {
            var ret:Array = new Array();
            var p:Array = perms( exclude(xs,x) );
            for(var j:uint=0 ; p[j] ; j++) {
                var ps:Array = p[j] as Array;
                ps.unshift(x);
                ret.push(ps);
            }   
            return ret;
        }   
    );  
}

public static function exclude(xs:Array, elem:*):Array {
    return xs.filter(
        function(item:*,index:uint,array:Array):Boolean { return item != elem; }
    );  
}

I had to define the exclude function, because it does not exist in ActionScript3 and I understand that.
The code is not as pretty as you expect to be in Haskell or other functional language, but is pretty much the same, as you ca see in the following definition, where I only use map.

public static function permss(xs:Array):Array {
    if(xs.length == 0) {
        return [[]];
    }
    return xs.map(
        function(x:*,index:uint,array:Array):* {
            var p:Array = permss( exclude(xs,x) );
            return p.map(
                function(y:*,_:uint,__:Array):* {
                    var ps:Array = y as Array;
                    ps.unshift(x);
                    return ps;
                }
            );
        }
    );
}
public static function exclude(xs:Array, elem:*):Array {
    return xs.filter(
        function(item:*,index:uint,array:Array):Boolean { return item != elem; }
    );
}




Correct sorting with Frama-C and some thoughts on Formal Methdos

12 02 2011

A couple of years ago, during my masters on Formal Methods I have been working with automatic provers and I also used Frama-C, this is a tool that allow the user to prove C code directly in the source code, using a special notation in the comments, called ACSL notation.

Frama-C allows you to make two kinds of proofs, security and safety ones. The safety ones are related with arrays index out of bounds access, and so. This kind of proofs are related to the language itself and they are easy to do if you use loop invariants, pre and post conditions.
If you use a high level language, like JAVA you won’t have almost none safety problems.
Because C is too close to machine level code, we can do things that we do not intend (or maybe we do and we use C exactly because it allows this kind of things). For example:

// foo.c file
#include <stdio.h>

int main() {
    char *a = "I like you";
    char *b = "I hate you";
    
    if(&a < &b) a = *(&a + 1); 
    else        a = *(&a - 1); 

    printf("%s\n", a); 
}

As you can see, I never used the b variable for nothing, just have declared it. And the result is:

[ulissesaraujocosta@maclisses:c]-$ gcc -o foo foo.c 
[ulissesaraujocosta@maclisses:c]-$ ./foo
I hate you

This lack of security of language C is one of the reasons we need to write safety statements. Of course this kind of things is why C is so fast and powerful, the person in charge is always the programmer. If you are interested in this kind of tricks and want to understand more about this and smashing the stack and so, feel free to read more posts in my blog about this subject.

The other kind of statements (security ones) are related to the functionality of the program and that’s basically where the problem or the effort is, I will talk about this later on. First let’s see the algorithm and the implementation in C.

Code

The algorithm I use here is just a simple example. I used bubble sort, this is a sort algorithm not very efficient, but it uses none more memory then the needed to store the structure you want to sort.
To get a visual understanding of the algorithm (and to see it inefficiency) check out this youtube video.

This is the implementation of the algorithm:

void swap(int *i, int *j) {
    int tmp = *i;
    *i = *j;
    *j = tmp;
}

void bubbleSort(int *vector, int tam) {
    int j, i;
    j = i = 0;
    for(i=0; i<tam; i++) {
		for(j=0; j<tam-i-1; j++) {
            g_swap = 0;
            if (vector[j] > vector[j+1]) {
                swap(&vector[j],&vector[j+1]);
            }
        }
    }
}

Pre, Post conditions and thinking formally

So, as you can see in the video (or in the code) the algorithm is pretty much simple, we pick the i element cross the array n times and for each element we compare with i, this n times.

We have as pre conditions: The size of the vector must be greater than zero, and all the positions in that vector exists, so in Frama-C we use the valid\_range(vector, i, j), where i and j are indexes of the vector to say that all elements exist.

tam > 0

valid\_range(vector,0,tam-1)

Ans as pos conditions we must ensure that the array is sorted ( I will talk predicate this later).
You may think that this by itself is enough to make a complete proof, but you are wrong. Image that my function clear all the elements in the array and fill the array with \{1,2,..,tam\}, our code will be proved and its wrong!

So, we need to say more… First thing that can pop to your head is OK, we will say that we have the same numbers in the beginning and in the end and you write this:
\forall_a : 0 \leq a < tam : (\exists_b : 0 \leq b < tam : old(vector(b)) \equiv vector(a))

In fact this is closer (not yet right), imagine that you give as input:
\{4,7,9,1,0,3,4\}. If your code returns \{0,1,3,4,7,9\} (we miss the repeated 4) the code will be proved.
So, the solution if to make a Permut predicate and prove for the multi set.
So, this are the post conditions:

sorted(vector,0,tam-1)

Permut\{Old,Here\}(vector,0,tam-1);

Frama-C is so cool because for example at the pos condition if we want to refer to the state in the beginning (before call the function) we use Old and if we want to refer to the moment after the call we heave the Here keyword, remember we are at the post condition, so this wil be executed in the end (so Here means the end of the function call).

Predicates

So, here is the Sorted predicate. Predicates receive a state L and the parameters (just like a function) and they return bool values (true or false). Inside we use regular ACSL notation. Here I define that for an array to be sorted each element must be less or equal to the next one.

/*@ predicate Sorted{L}(int a[], integer l, integer h) =
  @   \forall integer i; l <= i < h ==> a[i] <= a[i+1];
  @*/

The Permut is defined inductively, so we receive two states L1 and L2 and the array a and the range where we want to permute.
We write multiple rules for the permutation, reflection, symmetry, transitivity and finally the most important one, the Swap. So basically here we say that a permutation is a set of successive swaps.

/*@ inductive Permut{L1,L2}(int a[], integer l, integer h) {
  @  case Permut_refl{L}:
  @   \forall int a[], integer l, h; Permut{L,L}(a, l, h) ;
  @  case Permut_sym{L1,L2}:
  @    \forall int a[], integer l, h;
  @      Permut{L1,L2}(a, l, h) ==> Permut{L2,L1}(a, l, h) ;
  @  case Permut_trans{L1,L2,L3}:
  @    \forall int a[], integer l, h;
  @      Permut{L1,L2}(a, l, h) && Permut{L2,L3}(a, l, h) ==>
  @        Permut{L1,L3}(a, l, h) ;
  @  case Permut_swap{L1,L2}:
  @    \forall int a[], integer l, h, i, j;
  @       l <= i <= h && l <= j <= h && Swap{L1,L2}(a, i, j) ==>
  @     Permut{L1,L2}(a, l, h) ;
  @ }
  @
  @ predicate Swap{L1,L2}(int a[], integer i, integer j) =
  @      \at(a[i],L1) == \at(a[j],L2)
  @   && \at(a[j],L1) == \at(a[i],L2)
  @   && \forall integer k; k != i && k != j ==> \at(a[k],L1) == \at(a[k],L2);
  @*/

So, as you can see the bubble sort function itself have 18 lines of code, and in the end with the annotations for the proof we end with 90 lines, but we proved it!

Thoughts

My main point here is to show the thinking we need to have if we want to prove code in general. Pick what language you want, this is the easiest way you will have to prove software written in C. Sometimes if your functions are too complex you may need to prove it manually. The problem is not on the Frama-C side, Frama-C only generates the proof obligations to feed to automatic provers, like Yices, CVC3, Simplify, Z3, Alt-Ergo and so.

My point here is to show the cost of proving software. Proving software, specially if the language is too low level (like C – you need to care about a lot more things) is hard work and is not easy to a programmer without theoretical knowledge.
On the other side, you end up with a piece of software that is proved. Of course this proof is always requirements oriented, ny that I mean: if the requirements are wrong and the program is not doing what you expect the proof is along with that.
I do not stand to proof of all the code on the planet, but the proper utilization of FM (formal methods) tools for critical software.

I steel been using Frama-C since I learned it in 2009, nowadays I use it for small critical functions (because I want, I’m not encouraged to do so) and I have to say that the use of FM in the industry is far. As I told you Frama-C is the easiest automatic proof tool you will find at least that I know.

Talking with Marcelo Sousa about the use of FM in industry, we came to the conclusion that the people that are making this kind of tools and have the FM knowledge don’t make companies. I think if more brilliant people like John Launchbury make companies, definitely FM will be more used.

Source code

Here is all the code together if you want to test it:

// #include <stdio.h>

/*@ predicate Sorted{L}(int a[], integer l, integer h) =
  @   \forall integer i; l <= i < h ==> a[i] <= a[i+1];
  @
  @ predicate Swap{L1,L2}(int a[], integer i, integer j) =
  @      \at(a[i],L1) == \at(a[j],L2)
  @   && \at(a[j],L1) == \at(a[i],L2)
  @   && \forall integer k; k != i && k != j ==> \at(a[k],L1) == \at(a[k],L2);
  @*/

/*@ inductive Permut{L1,L2}(int a[], integer l, integer h) {
  @  case Permut_refl{L}:
  @   \forall int a[], integer l, h; Permut{L,L}(a, l, h) ;
  @  case Permut_sym{L1,L2}:
  @    \forall int a[], integer l, h;
  @      Permut{L1,L2}(a, l, h) ==> Permut{L2,L1}(a, l, h) ;
  @  case Permut_trans{L1,L2,L3}:
  @    \forall int a[], integer l, h;
  @      Permut{L1,L2}(a, l, h) && Permut{L2,L3}(a, l, h) ==>
  @        Permut{L1,L3}(a, l, h) ;
  @  case Permut_swap{L1,L2}:
  @    \forall int a[], integer l, h, i, j;
  @       l <= i <= h && l <= j <= h && Swap{L1,L2}(a, i, j) ==>
  @     Permut{L1,L2}(a, l, h) ;
  @ }
  @*/

/*@ requires \valid(i) && \valid(j);
  @ //assigns *i, *j; //BUG 0000080: Assertion failed in jc_interp_misc.ml
  @ ensures \at(*i,Old) == \at(*j,Here) && \at(*j,Old) == \at(*i,Here);
  @*/
void swap(int *i, int *j) {
    int tmp = *i;
    *i = *j;
    *j = tmp;
}

/*@ requires tam > 0;
  @ requires \valid_range(vector,0,tam-1);
  @ ensures Sorted{Here}(vector, 0, tam-1);
  @ ensures Permut{Old,Here}(vector,0,tam-1);
  @*/
void bubbleSort(int *vector, int tam) {
    int j, i;
    j = i = 0;
    //@ ghost int g_swap = 0;

  /*@ loop invariant 0 <= i < tam;
    @ loop invariant 0 <= g_swap <= 1;
    //last i+1 elements of sequence are sorted
    @ loop invariant Sorted{Here}(vector,tam-i-1,tam-1);
    //and are all greater or equal to the other elements of the sequence.
    @ loop invariant 0 < i < tam ==> \forall int a, b; 0 <= b <= tam-i-1 <= a < tam ==> vector[a] >= vector[b];
    @ loop invariant 0 < i < tam ==> Permut{Pre,Here}(vector,0,tam-1);
    @ loop variant tam-i;
    @*/
    for(i=0; i<tam; i++) {
      //@ ghost g_swap = 0;
      /*@ loop invariant 0 <= j < tam-i;
        @ loop invariant 0 <= g_swap <= 1;
        //The jth+1 element of sequence is greater or equal to the first j+1 elements of sequence.
        @ loop invariant 0 < j < tam-i ==> \forall int a; 0 <= a <= j ==> vector[a] <= vector[j+1];
        @ loop invariant 0 < j < tam-i ==> (g_swap == 1) ==> Permut{Pre,Here}(vector,0,tam-1);
        @ loop variant tam-i-j-1;
        @*/
		for(j=0; j<tam-i-1; j++) {
            g_swap = 0;
            if (vector[j] > vector[j+1]) {
                //@ ghost g_swap = 1;
                swap(&vector[j],&vector[j+1]);
            }
        }
    }
}

/*@ requires \true;
  @ ensures \result == 0;
  @*/
int main(int argc, char *argv[]) {
    int i;
    int v[9] = {8,5,2,6,9,3,0,4,1};

    bubbleSort(v,9);

//     for(i=0; i<9; i++)
//         printf("v[%d]=%d\n",i,v[i]);

    return 0;
}

If you are interested in the presentation me and pedro gave at our University, here it is:





Lines of Code and related lines-oriented-statistics with Perl

6 02 2011

A project I’m involved right now is making a Static code analyzer. The main goal is to produce a RoR front-end webapp with the capability to submit code ad analyze it statically. Our main goal languages are C/C++, but we will attach other tools to do some work for other languages. For Static I mean without the need to run the program. One thing we will never be able to answer with Static analysis is the behavior of the program, but we can answer some things that with dynamic analysis we can not, so there is not such a think like one is better or more complete than the other.

The RoR front-end is almost done and quick contribution I gave was a Perl script that receives a folder as input and analyzes all the source code available inside the folder (recursive). You can find the README.markdown under the same folder.
This analysis is just oriented to quantity of lines of code.

How to use

Well, if you don’t want to read the README.markdown file and experiment the script I will show you the output and the command you need to produce this images.

So, by default you only need to say the input folder and the output prefix name for the images, like so we an say:

[ulissesaraujocosta@maclisses:trab1]-$ perl count.pl -open ../../../Static-Code-Analyzer/ -out work

This will produce the following 3 images (click on the images to enlarge):



By default the script always produce this 3 images: number of files per language, number of lines per language and the ratio between them (the average of lines per file, per language).

We can also see this values in percentage, related to overall project (folder).
Image number 3 will be the same, because does not make sense ratio percentage.

[ulissesaraujocosta@maclisses:trab1]-$ perl count.pl -open ../../../Static-Code-Analyzer/ -out work_percent -percent



And my favorite, produce an image with a overall picture of the percentage use of each language on the project.

[ulissesaraujocosta@maclisses:trab1]-$ perl count.pl -open ../../../Static-Code-Analyzer/ -out work_All -all

With this script you can also generate pie charts, by default it uses bars charts, please read the README.markdown for more information.

How to improve and support more languages?

If you want to improve the script feel free to fork on github and maybe we can discuss more about the script.
To support more languages you just have to add one more entrance in the hashtable and write:

extension => {"nrFiles" => 0, "nrLines" => 0, "comments" => function_to_catch_comments, "nrComments" => 0, "percentageNrFiles" => 0, "percentageNrLines" => 0, "percentageNrComments" => 0}

This is an example of the entrace for C++:

"cpp"  => {"nrFiles" => 0, "nrLines" => 0, "comments" => sub { return shift =~ m/(\*(.|\n|\r)*?\*)|(^[ \t\n]*\/\/.*)/; },    "nrComments" => 0,
                         "percentageNrFiles" => 0, "percentageNrLines" => 0, "percentageNrComments" => 0
                        },

This is a simple presentation I gave about the module I used: GD::Graph





A* search algorithm

1 01 2011

Time to talk about efficiently pathfinding and graph traversal algorithms. The first algorithm related with graphs pathfinding I learned was Dijkstra’s algorithm and I remember the feeling for learning how to find the shortest path (minimal cost to be generic) in a graph, it was amazing to me, so that I learned a few more and did a simple academic GPS core system.

Dijkstra’s algorithm is truly beautiful, but unfortunately the complexity it too high to be considered time efficient.

If you want to go from point A to B Dijkstra’s wil search all the surrounding nodes, as you can see in this image:

So I start to find more efficient implementations for the pathfinding problem and I discover A Star.

A Star algorithm

I find this algorithm in wikipedia, and I will paste it here because there are a few things I want to explain.

function A*(start,goal)
     // The set of nodes already evaluated.
     closedset := the empty set
     // The set of tentative nodes to be evaluated.
     openset := set containing the initial node
    // The map of navigated nodes.
     came_from := the empty map                 
    // Distance from start along optimal path.
     g_score[start] := 0                        
     h_score[start] := heuristic_estimate_of_distance(start, goal)
    // Estimated total distance from start to goal through y.
     f_score[start] := h_score[start]           
     while openset is not empty
         x := the node in openset having the lowest f_score[] value
         if x = goal
             return reconstruct_path(came_from, came_from[goal])
         remove x from openset
         add x to closedset
         foreach y in neighbor_nodes(x)
             if y in closedset
                 continue
             tentative_g_score := g_score[x] + dist_between(x,y)
             if y not in openset
                 add y to openset
                 tentative_is_better := true
             elseif tentative_g_score < g_score[y]
                 tentative_is_better := true
             else
                 tentative_is_better := false
             if tentative_is_better = true
                 came_from[y] := x
                 g_score[y] := tentative_g_score
                 h_score[y] := heuristic_estimate_of_distance(y, goal)
                 f_score[y] := g_score[y] + h_score[y]
     return failure
 
 function reconstruct_path(came_from, current_node)
     if came_from[current_node] is set
         p = reconstruct_path(came_from, came_from[current_node])
         return (p + current_node)
     else
         return current_node

If you are familiarized with Dijkstra’s algorithm you probably noticed a lot of coincidences in the algorithm. You are right!
So, we have some structures to use: the closedset is the set of nodes already evaluated by A Star, openset containing the nodes being evaluated, g\_score being the commulative distance to this node, h\_score being the heuristic result for this node (I will explain this in a minute) and f\_score being the sum of g and h.

A good thing about A Star is the nodes it needs to search until find the Best-First-Search path:

Seems good right? All the juice of this algorithms lies on the heuristic function, I like the result of the Manhattan distance, but you can read this blog post and find out more about this subject.
Basically this heuristic is empirical knowledge, this particular Manhattan distance calculates the distance from point A to point B in a grid.

I tried to use it in a Geometric graph and it worked fine too!

Point of view

I’m particularly interested in optimizing this algorithm as much as possible and will be doing that using C++.
So, I spend 30 minutes observing the code and I was already very familiarized with Dijkstra’s. I think as a Software Engineer you have to find good algorithms to your problem, deeply understand them, but your job is not done after that! After find a good solution and understand it to the point you are able to explain it to a non-CS person you have to *observe* the code, talk with it and, believe me, don’t make the literal implementation of it, It will be slow, or at least it probably could be more efficient!

So, let’s forget about the problem this algorithm solves and try to identify inefficient chunks in the algorithm. The first thing that cames to my mind is: we need to have a bunch of structures to keep a lot of node related information, a lot of vectors, sets and so.

So, let’s identify what I mean by that:

function A*(start,goal)
...
     while openset is not empty
...
         x := the node in openset having the lowest f_score[] value
...
         foreach y in neighbor_nodes(x)
...
                 g_score[y] := tentative_g_score
                 h_score[y] := heuristic_estimate_of_distance(y, goal)
                 f_score[y] := g_score[y] + h_score[y]

With this chunk of code I want to highlight that we are iterating throw all neighbors for each x belongs to openset, then we get the minimum f from openset and change the g,h,f arrays for y.

The first thing hitting me is, I can make openset a MinHeap and I can keep all the information for each *\_score in the node itself, like so I won’t be wasting time in accessing positions in a set and I just make a question to the node object.

So, I start to put all this information in the nodes side and keep only track of the locally created minHeap. This is the result:

LinkedList<Edge*> AStar::findPath(Node* start, Node* goal) {
    LinkedList<Edge*> ret;
    MinHeap<Node*, Comparator> openSet;
    bool tentative_is_better = false;

    float h = heuristic(start,goal);

    start->setStatus(NODE_OPEN);
    start->getOrSetScore().set(NULL, h,0,h);
    openSet.insert(start);

    while(!openSet.isEmpty()) {
        Node x = openSet.getMin();
        AStarScore & xScore = x->getOrSetScore();

        if(x == goal) return process(ret,x);
        openSet.removeMin();
        x->setStatus(NODE_CLOSE);
        ArrayList<Edge> & neighbors = x->getEdges();
        for(int i = 0; i < neighbors.getLength (); i++) {
            Node *y = neighbors [i]->getDstNode();
            AStarScore & yScore = y->getOrSetScore();
            GeoNodeStatus yStatus = y->getStatus();

            if(yStatus == NODE_CLOSE) {
                continue;
            }   

            float tentative_g_score = xScore.g_score + x->getEdge(y)->getCost();

            if(yStatus != NODE_OPEN) {
                y->setStatus(_version,GEONODE_OPEN);
                tentative_is_better = true;
            } else if(tentative_g_score < yScore.g_score) {
                tentative_is_better = true;
            } else {
                tentative_is_better = false;
            }   
            if(tentative_is_better) {
                yScore.parent = x;
                yScore.g_score = tentative_g_score;
                yScore.h_score = heuristic(y,goal);
                yScore.f_score = yScore.g_score + yScore.h_score;
                openSet.insert(y);
            }
        }
    }
    return process(ret,goal);
}

Where process is the function that iterate throw the nodes, starting with goal and go the parent until start node and construct the path in reverse order, from start to goal.

One side note, A star is so close to Dijkstra, that if you make you heuristic function return always zero, A star will work just like Dikjstra (the first image).

Basically the ideas I want to transmit here is the A star algorithm, it implementation and (the most important one) the process of how to look to an algorithm. Summarizing I think a Software Engineer could not just implement algorithms and substitute all the words insert(lista,elem) in the algorithm for list.push\_back(elem) in C++ and so. I think we should look for an algorithm as a valuable aid and if we convert it to code we must improve it. Ultimately copying an algorithm to code is a good opportunity to leave our personal touch in the code.

Acknowledgments

Images from here.





Memory Leak detector

1 12 2010

When you are writing a program and you want it to be able to run for a long period of time or when some procedure is called a x times in the lifetime of running program (let x tend to infinity) maybe is a good opportunity to make sure all your malloc calls have one free call (new,delete in C++).This sounds very natural to any experienced programmer.

Now imagine that you have download some huge chunk of code and you want to call it x times, lets say 100 times per second. Maybe the person who wrote the code, did not think about memory management and maybe you do not have the patience to read the huge piece of code. What you need is a memory leak detector.

If you do not have any of these problems might be interested in knowing how to gain millions of euros on the internet.

Here I will show an example of how to write a simple memory leak detector in C++ for C (malloc,free), is trivial to convert it to verify memory leaks in C++.

The main idea is to keep the line number of each malloc we do and then verify if we did free, we will accomplish this by redefining the malloc and free call, without need to write our own free or malloc, thanks to C pre processor.

First we define an entity, this keep all the information we need about each malloc call.

class Entity {
private:
	unsigned int _address;
	unsigned int _line;
public:
	Entity(unsigned int addr, unsigned int line) {
		this->_address = addr;
		this->_line = line;
	}
	unsigned int & getOrSetAddress() {
		return this->_address;
	}
	unsigned int getLine() {
		return this->_line;
	}
};

typedef std::list<Entity *> Memory;
Memory *memory;

I used the standard C++ lists instead of std::vector because I will need to remove and add elements, without the need to call find in algorithms module, what bugs me.

Now I will define the function to add elements to our std::list

void InsertOnMalloc(unsigned int addr, unsigned int n) {
	if(!memory)
		memory = new Memory;

	memory->insert(memory->begin(), new Entity(addr,n));
}

Now I will define the function to remove elements to our std::list

void RemoveOnFree(unsigned int addr) {
	if(!memory)
		return;
	for(Memory::iterator it = memory->begin(); it != allocList->end(); it++)
	{
		Entity *entity = *it;
		if(entity->getOrSetAddress() == addr) {
			memory->remove();
			break;
		}
	}
}

The idea so far is to call InsertOnMalloc for each malloc call and RemoveOnFree for each free. After this if we have any element in our memory we need to print that out, to see which line we do not call free.

void ShowLeaks() {
	for(Memory::iterator it = memory->begin(); it != memory->end(); it++)
		std::cout << "Must make free to this malloc call: " << (*it)->getLine() << std::endl;
}

Now, lets do the real juice of this leak detector: here I will modify the malloc call to a regular malloc followed by an insertion, and a free call for a regular free call followed by removal on our memory list

#define malloc(p) mallocz((p),__LINE__)
#define free(p) freez((p))

static inline void *  mallocz(unsigned int size, int line) {
	#undef malloc
	void *ptr = (void *)malloc(size);
	#define malloc(p) mallocz((p),__LINE__)
	Insert((unsigned int)ptr, line);
	return(ptr);
};
static inline void  mydelz(void *p) {
	RemoveOnFree((int)p);
	#undef free
	free(p);
	#define free(p) freez((p))
};

Now, if you want to use this “library” in your code, you save it to file filename.h and on you code you just import it and then call ShowLeaks() to show what mallocs you forgot to make free.





Splint the static C code checker

3 05 2009

Splint is this great tool to statically analyze C code. Splint is an evolution of Lint. Lint makes analysis of C code and verifies unused declarations, type inconsistencies, use before definitions, unreachable code, ignored return values, execution path with no return and infinite loops. However, Lint was not a sufficiently powerful tool. Splint was created as a descendant of Lint allowing annotations in C code, so, it do more checks than its predecessor. In this post I will present some verifications that Splint does, and show the philosophy and how to use it. Splint allows annotations on functions, variables, parameters and types.

Undefined Variables

Splint detects instances where the value of a variable is used before it is defined. Anyway, many static analyzers can also do this, it is a very basic check. However, thanks to the splint annotations supports, they can be used to describe what storage must be defined and what storage must be undefined at interface points, this means that all storage reachable from a global variable, parameter to a function, or function return value is defined before and after the function call.

A special case of undefined variables is undefined function parameters. Sometimes we do a certain C function that return values, or change parameters values. The out annotation denotes a pointer to storage that may be undefined. Splint have a great storage model, so does not report an error when a pointer to allocated but undefined storage is passed as an out parameter. If in the body of a function an outparameter is allocated but not bounded to a value Splint reports an error. You can see out as an parameter that will be defined inside a function. The opposite happens to the in annotation, this can be used to denote a parameter that must be completely defined.
Here is a small example of it:

extern void setVal (/*@out@*/ int *x);
extern int getVal (/*@in@*/ int *x);
extern int mysteryVal(int *x);

int dumbfunc(/*@out@*/ int *x, int i) {
    if (i > 3)
11         return *x;
    else if (i > 1)
13         return getVal (x);
    else if (i == 0)
15         return mysteryVal (x);
    else {
18         setVal (x);
19         return *x;
    }
}

And here it is the output of Splint:

> splint usedef.c
usedef.c:11: Value *x used before definition
usedef.c:13: Passed storage x not completely defined
    (*x is undefined): getVal (x)
usedef.c:15: Passed storage x not completely defined
    (*x is undefined): mysteryVal (x)
Finished checking --- 3 code warnings

As you can see, no error is reported for line 18, because the incomplete defined storage x is passed to a function that will define it, so we can return *x with no problems.

Types

Strong type checking often reveals programming errors. Splint can check primitive C types more strictly and flexibly than typical compilers.

C, does not make types verification. Because for C all basic non numerical types are just int’s (Enum, Pointer, Char). Splint supports stricter checking of built-in C types. The char and enum types can be checked as distinct types, and the different numeric types can be type-checked strictly. Also the primitive char type can be type-checked as a distinct type. If char is used as a distinct type, common errors involving assigning ints to chars are detected.

In Splint, you can always turn off some default verifications, and Splint will assume the C compiler strategy to verify the code. If you run Splint with +charint is on, char types are indistinguishable from ints. So you can make cast’s from int to char or from cahr to int. But never forgot that this can lead to errors, as you can imagine if you do (char) 1000 you will not get an expected result because the characters only go up to 256.

Splint reports errors in the use of Enums, if a value that is not an enumerator member is assigned to the enum type or if an enum type is used as an operand to an arithmetic operator. Anyway you can turn this off, activating the enumint flag and then you can use enums and ints types interchangeably.

Memory Management

About half the bugs in typical C programs can be attributed to memory management problems.

Any C programmer who has already made a reasonable size program, has been confronted with unexpected errors related with memory management problems. This happens basically because C is a low level language so, does not have any system like Java’s garbage collector. All the memory management have to be done by the programmer, so errors may append. Some only appear sporadically, and some may only be apparent when compiled on a different platform. Splint is able to detect many memory management errors at compile time, like: using storage that may have been deallocated, memory leaks or returning a pointer to stack-allocated storage. This is possible because Splint makes a memory model when is working on the code, anyway I will not talk about this subject.

Splint can see and report deallocating storage when there are other live references to the same storage and failing to deallocate storage before the last reference to it is lost. The solution here is simple, Splint makes an obligation to release storage and attach this obligation to the reference to which the storage is assigned.

For references not shared Splint uses the only annotation to indicate that a reference is the only pointer to the object it points to, here we also use the null annotation to say that the output of malloc could be NULL:

/* @only@ */ /* @null@ */ void * malloc ( size_t size );

Here is a simple example of C annotated code with memory leaks and use after free variables:

1 extern /*@only@*/ int *glob;
/*@only@*/ int *f (/*@only@*/ int *x, int *y, int *z) /*@globals glob;@*/ {
8    int *m = (int *)
9    malloc (sizeof (int));
11   glob = y;   //Memory leak
12   free (x);
13   *m = *x;   //Use after free
14    return z;   //Memory leak detected
}

Here is the Splint output for the file above:

> splint only.c
only.c:11: Only storage glob (type int *) not released before assignment: glob = y
only.c:1: Storage glob becomes only
only.c:11: Implicitly temp storage y assigned to only: glob = y
only.c:13: Dereference of possibly null pointer m: *m
only.c:8: Storage m may become null
only.c:13: Variable x used after being released
only.c:12: Storage x released
only.c:14: Implicitly temp storage z returned as only: z
only.c:14: Fresh storage m not released before return
only.c:9: Fresh storage m allocated

Splint errors and warnings are very human readable, so you just have to read them to understand.
The first warning says that variable glob was not released (we say, in line 1, that it is only) before the assignment. In line 13 we have a dereference of possibly null pointer, because in line 12 we have done free to x variable, an now we want to use it’s value. As you can see, is very easy to understand the output of Splint.

Splint can also make verifications in stack based references. A memory error occurs if a pointer into stack is live after the function returns. Splint detects errors involving stack references exported from a function through return values or assignments to references reachable from global variables or actual parameters.
No annotations are needed to detect stack reference errors. It is clear from declarations if storage is allocated on the function stack.

Here is an example of stack-allocated storage:

int *glob;

int *f (int **x) {
    int sa[2] = { 0, 1 };
    int loc = 3;
9   glob = &loc;
10  *x = &sa[0];
12  return &loc;
}

And here is the Splint output:

> splint stack.c
stack.c:12: Stack-allocated storage &loc reachable from return value: &loc
stack.c:12: Stack-allocated storage *x reachable from parameter x
stack.c:10: Storage *x becomes stack
stack.c:12: Stack-allocated storage glob reachable from global glob
stack.c:9: Storage glob becomes stack

Control Flow

In order to avoid certain errors, Splint have to understand the control flow of the program, so Splint do some checks related to control flow. Many of these checks are possible because of the extra information that is known in annotations. Without this additional information Splint assumes that all functions return and execution continues normally.

noreturn annotation is used to denote a function that never returns.

extern /* @noreturn@ */ void fatalerror ( char *s);

We also have maynoreturn and alwaysreturns annotations, but Splint must assume that a function returns normally when checking the code and doesn’t verify if a function really returns.

To describe non-returning functions the noreturnwhentrue and noreturnwhenfalse mean that a function never returns if the first argument is true or false.

/* @noreturnwhenfalse@ */ void assert (/* @sef@ */ bool /* @alt int@ */ pred );

The sef annotation denotes a parameter as side effect free, and the alt int indicate that it may be either a Boolean or an integer.

Undefined Behavior

The order which side effect take place in C is not entirely defined by the code.

A sequence point is some point in the C code where it is guaranteed that all side effects of previous evaluations have been performed.
An example of sequence points is:

  • a function call (after the arguments have been evaluated)
  • at the end of a if, while, for or do statement
  • a &&, || and ?

Here is a simple C program that have undefined behavior.

extern int glob;
extern int mystery (void);
extern int modglob (void) /*@globals glob@*/ /*@modifies glob@*/;

int f (int x, int y[]) {
11    int i = x++ * x;
13    y[i] = i++;
14    i += modglob() * glob;
15    i += mystery() * glob;
16    return i;
}

And here is the Splint output:

> splint order.c
order.c:11: Expression has undefined behavior (value of right operand modified by left operand): x++ * x
order.c:13: Expression has undefined behavior (left operand uses i, modified by right operand): y[i] = i++
order.c:14: Expression has undefined behavior (value of right operand modified by left operand): modglob() * glob
order.c:15: Expression has undefined behavior (unconstrained function mystery used in left operand may set global variable glob used in right operand): mystery() * glob

Regard control flow, Splint has more options to check C code. I will not talk about all of them here.

Buffer Sizes

Buffer overflow errors are a particularly dangerous type of bug in C, they are responsible for half of all security attacks. This is happens because C does not perform runtime bound checking (for performance reasons), and so attackers can exploit program bugs to, for example, gain full access to a machine.

Splint models blocks of memory using two properties:
maxSet, maxSet(b) denotes the highest address beyond b that can be safely used as lvalue, for instance:

char buffer[MAXSIZE] we have maxSet(buffer) = MAXSIZE - 1

and maxRead, maxRead(b) denotes the highest index of a buffer that can be safely used as rvalue.

When a buffer is accessed as an lvalue, Splint generates a precondition constraint involving the maxSet property and when a buffer is accessed as an rvalue, Splint generates a precondition constraint involving the maxRead property.

We can use this two properties in buffer sizes annotations, this annotations uses the clause requires and ensures just like JML, VDM and Frama-C. When a function with requires clause is called, the call site must be checked to satisfy the constraints implied by requires.

Here is an example of its use:

void /* @alt char * @*/ strcpy
(/* @out@ */ /* @returned@ */ char *s1 , char *s2)
/* @modifies * s1@ */
/* @requires maxSet (s1) >= maxRead (s2) @*/
/* @ensures maxRead (s1) == maxRead (s2) @*/;

This is a possible annotation for strcpy standard library C function. We say that s1 is the return value of the function (returned), and that the pointer to s1 is the only one (only). We also say that this function modifies s1 and we specify the pre and post conditions.

Splint can also do bound checking. This is more complex than other checks done by Splint, so, memory bound warnings contain extensive information about the unresolved constraint as you can see in this example:

int buf [10];
buf [10] = 3;
setChar .c :5:4: Likely out -of - bounds store : buf [10]
Unable to resolve constraint : requires 9 >= 10
needed to satisfy precondition : requires maxSet ( buf @ setChar .c :5:4) >= 10

The ultimate test: wu-ftpd

wu-ftpd version 2.5.0 was about 20.000 lines of code and took less than four seconds to check all of wu-ftpd on a 1.2-GHz Athlon machine. Splint detected the known flaws as well as finding some previously unknown flaws (!)

One of the problems of these automatic static analysis tools is that it can produce false problems.
Running Splint on wu-ftpd without adding annotations produced 166 warnings for potential out-of-bounds writes. After adding 66 annotations, it produced 101 warnings: 25 of these indicated real problems and 76 were false.

Pros and Cons

Pros:

  • Lightweight static analysis detects software vulnerabilities
  • Splint definitely improves code quality
  • Suitable for real programs…

Cons:

  • …although it produces more warning messages that lead to confusion
  • It won’t eliminate all security risks
  • Hasn’t been developed since 2007, they need new volunteers

Conclusions

No tool will eliminate all security risks, lightweight static analysis tools (Splint) play an important role in identifying security vulnerabilities.

Here is the presentation about that subject:

Outroduction

All the examples provide here is from the Splint manual.