Lex/yacc

31 07 2008

Intro

Since long ago that I wanted to write about this. Finally I now have the time to write about syntactic analyzers generators.

When I was studding yacc noticed that there was much information available, not something one would expect of a tool with 40 years. Unfortunately there was no quick learning for someone who already knew the whole theory behind this kind of tools. As I said, the yacc is a very old tool, made when was needed compilers. Before there the whole theory of that area that we know today. So it is normal small errors transform themselves into disasters that may take a long time to be resolved.

This post is for those who already know what grammars are, lexical analyzers, etc., but that never touch in Lex/yacc or for those who are stuck with this things (I hope).

In this post I will explain to you Lex/yacc, with an example of a program that I did. I will not boring you with massive lines of code (like I usually do), because Lex/yacc is just a simple tool to use. With just 60 lines of explained code you will see how great that is.

The problem

My idea was to do a simple site with all the information about all the cities in Portugal. After reminded me to so but to all parishes (Freguesia in Portuguese), counties(Concelho in Portuguese) and districts(Distrito in Portuguese) of Portugal and their relationship. A district has many counties and one county has many parishes.

Simple! What I needed now was all the names of all parishes, counties and districts of Portugal, I eventually found three files with all this information and more in the Portuguese company of post office website. Better yet, these files had the relationship between each division. I have to do a small program in C that take those three files and create a new file, which I use on the Lex / yacc.

In order to have all available information on each local od Portugal I decided to use Wikipedia. This is the Achilles heel of the program, because unfortunately the entries of wikipedia on the Portuguese parishes are not uniform and many of the parishes or have entry. But still able to have good results.

The structure of the new file generated is:

District1>WikipediaEntry_for_District1
        Parish1_of_Countie1>WikipediaEntry_for_Parish1_of_Countie1,
        Parish2_of_Countie1>WikipediaEntry_for_Parish2_of_Countie1,
        Parish3_of_Countie1>WikipediaEntry_for_Parish3_of_Countie1,
        .
        .
        .
        ParishN_of_Countie1>WikipediaEntry_for_ParishN_of_Countie1
        !Countie1>WikipediaEntry_for_Countie1;
        Parish1_of_Countie2>WikipediaEntry_for_Parish1_of_Countie2,
        Parish2_of_Countie2>WikipediaEntry_for_Parish2_of_Countie2,
        Parish3_of_Countie2>WikipediaEntry_for_Parish3_of_Countie2,
        .
        .
        .
        ParishN_of_Countie1>WikipediaEntry_for_ParishN_of_Countie2
        !Countie2>WikipediaEntry_for_Countie2;
                .
                .
                .
        !CountieN>WikipediaEntry_for_CountieN|

District2>WikipediaEntry_for_District2
        Parish1_of_Countie1>WikipediaEntry_for_Parish1_of_Countie1,
        Parish2_of_Countie1>WikipediaEntry_for_Parish2_of_Countie1,
        Parish3_of_Countie1>WikipediaEntry_for_Parish3_of_Countie1,
        .
        .
        .
        ParishN_of_Countie1>WikipediaEntry_for_ParishN_of_Countie1
        !Countie1>WikipediaEntry_for_Countie1;
        Parish1_of_Countie2>WikipediaEntry_for_Parish1_of_Countie2,
        Parish2_of_Countie2>WikipediaEntry_for_Parish2_of_Countie2,
        Parish3_of_Countie2>WikipediaEntry_for_Parish3_of_Countie2,
        .
        .
        .
        ParishN_of_Countie1>WikipediaEntry_for_ParishN_of_Countie2
        !Countie2>WikipediaEntry_for_Countie2;
                .
                .
                .
        !CountieN>WikipediaEntry_for_CountieN|
.
.
.

It have about 43000 lines, the number of parishes in Portugal.

Yacc

An Yacc file is just like an lex file, it is divided in three parts:

DECLARATIONS
%%
GRAMMAR
%%
FUNCTIONS

I Will explain later on, what to put in all those three parts.

The solution

I start to write the following grammar to describe my new file:

OrGeo     -> Districts

Districts -> District '|'
          | Districts District '|'

Distrito  -> IdD Link Counties

Counties -> Countie
          | Counties ';' Countie

Countie  -> Parishes '!' IdC Link

Parishes    -> Parish
          | Parishes ',' Parish

Link      -> '>' l

IdD       -> id

IdC       -> id

Parish     -> IdL Link

IdL       -> id

As id is a name (of the District, Countie or Parish), we will declare it in yacc as a pointer to chars (vals). To do that we create a union like in the first part of yacc file (DECLARATIONS), and add a association with vals with id and l:

%union {
        char *vals;
}

%token  id l

Yacc use that union because you can declare as many associations as you want. We must refer to that union in the lex as yylval.

This is the same as use:

%token  id
%token  l

We now go to lex file and add the rule that the lexical analyzer will meet when find text.

[ a-zA-ZÀ-ú-'.()/'`0-9]+  { yylval.vals = strdup(yytext);
                                    return id; }
[a-zA-ZÀ-ú-'.()/'`0-9_:]+ { yylval.vals = strdup(yytext);
                                    return l; }

Here we are saying that when lex find some text that fills in that regular expression it will return to yacc as an id, so that way we find the names of our cities, or links.

As you can see we have special symbols in our grammar, (!>,;|), so we need to say to lex to return them to yacc, where we need them:

[!>,;|]                           { return yytext[0]; }

I also will say to lex ignore all n and t characters:

[tn]                            { ; }

Making our grammar powerfull

Now our grammar will suffer some adjustments; we will say yacc what to do when it was in some of the derivations of some rule:

OrGeo     : Districts { d = $1; }
          ;
Districts : District '|' { $$ = $1; }
          | Districts District '|' { $$ = catDistricts($1,$2); }
          ;
District  : IdD Link Counties { $$ = addDistrict($1, $2, $3); }
          ;
Counties : Countie { $$ = $1; }
          | Counties ';' Countie { $$ = catCounties($1, $3); }
          ;
Countie  : Parishes '!' IdC Link { $$ = addCountie($1, $3, $4); }
          ;
Parishes    : Parish { $$ = $1; }
          | Parishes ',' Parish { $$ = catParishes($1, $3); }
          ;
Link      : '>' l { $$ = $2; }
          ;
IdD       : id { $$ = $1; }
          ;
IdC       : id { $$ = $1; }
          ;
Parish     : IdL Link { $$ = addParish($1, $2); }
          ;
IdL       : id { $$ = $1; }
          ;

Here we tell yacc how to behave when pass certain derivation of a rule.
We can tell yacc that some rule can return a data type, for example:

%{
#include 
#include 
#include "district.h"

District *d;
%}

%union {
        char *vals;
        Parish *valf;
        Countie *valc;
        District *vald;
}

%type   Link IdL IdD IdC
%type   Parish Parishes
%type   Countie Counties
%type   District Districts

To return something in a rule we refer to that rule as $$, that mean IdL in

IdL       : id

and $1 to refer to that id, and so on. So, that way we can say that $$ = $1, that means, IdL = id.

Funtions catDistricts, addDistrict, addParish, addCountie, catCounties and catParishes, are just functions to create linked list’s and append new elemt’s in one already existent linked list.

The result is a html page with nested liked lists, here is the result.

Notes

All the code, including the code to generate the new file, used to yacc is available here, fell free to use it.

Advertisements




Type inference

30 07 2008

Intro

The type inference is the ability to a programming language deduct the data types of all functions of a program. It is a feature present in many strongly typed languages, such as Haskell. Where is not mandatory write the signature of the functions. What is great because it increases the production of code, and security because if the inference algorithm fail means that we have an error of types in our code, all this in compilation time.

As I said in previous post, one of the upgrades of Pointfree calculator is the type inference. After reading some scientific articles about Damas-Milner algorithm, also known as W algorithm, I began to imagine a way to implement this in Java, which is the mother language of Pointfree calculator. I started to do some sketches on paper and, after talking with professor José Nuno Oliveira, I realize that the algorithm isn’t that hard.

Remainings

Definition of (either in Haskell):

NOTE: in types means Either in Haskell



Definition of :

Type signature of Left and Right:


Talking same language

I will use a different notation to represent the domain and codomain of functions in order to help the explanation of the algorithm.

For function we have the type:

I will write that as:

Remember the definition of , we receive two functions, f and g. Because the notation is in pointfree, we represent also de domain and codomain of function in front of that, like we do for f and g.
In fact the type of is represented as:

I will also use the symbol , to say that type a unify with type b, that means, informally, that .

Let’s infer!

I will explain the algorithm to infer the type of function f:

The first step of the algorithm is attribute to all functions polymorphic types, so I will call the first type and the last

Because, have type , we conclude ;
Also, because have the type , we can conclude ;
Same thing to , that have the type , we can conclude and , so we have:

Because, the definition of : , we can say that the domain of f is equal to codomain of g, and so we can conclude , as we replace a type that is used in the codomain of first Right, we must also conclude , so:

As I explain before, the function , have the following type: , so:
and ;
Because have the type: , so and :

Because the definition of is , we need the same codomain in both functions, so we conclude , as both type trees have the same structure, we can conclude even more: , so:

And now we have the function, just with the needed types to simplify:

.

Now we just need to unify: and ,

.

We infer the type for function , .
Or if you prefer; in Haskell:

f :: Either (Either a b) c -> Either a (Either b c)




Pointfree Calculator

29 07 2008

Intro

In this semester had to do this project with my friend, João Moura, under supervision of professor Alcino Cunha.

The project was to make a program that make automatic the process of proving an equality of functions written in the pointfree style. Not completely automatic, but so iterative. The program load a file with rules and the user inserts an equality of functions in pointfree and can surf the words to find a sub-expression which want to run a particular rule. From the side are shown all the rules that may apply to selected sub-expression.

What is pointfree?

Pointfree is a style of programming that helps us to compose functions. Even more, it helps us later to prove equality between functions.

How can someone prove equality between functions? We can do so, because we have rules. Let’s see an analogy with algebraic calculus.







We have proved that .

Pointfree calculus as is done now, on paper, is boring and repetitive. A proof is this image of someone trying to make a proof.

 

Abstract representation

If swap is a functions for exchange the elements of a pair:

swap (a,b) = (b,a) --in pointwise
swap = split snd fst --in pointfree

And we want to prove that: . This is clearly obvious, but I will use this example just to show you the representation that we use to see these proofs.

So, here we have the representation of :

If you make a inorder passage it make sense.

As you can see the operator have arity n. Because we want that this operator became flat, we never want to use the law:
.

Function f is variable, and swap is constant, we know their definition.

Imagine now that we load the following file of rules:





Path

In order to know that sub-expression are to select a term, we have the Path which behaves as follows:

If we select 4 in the following expression , the Path becomes: .
If we select 2, the Path becomes: .

the list in front of the Path is to select a range of elements in an operator with arity n. Thus, if we want select in , the Path becomes: .

Proof of

This process is pretty always the same, but I suggest that you follow with some attention.

In this first tree I will show the Path already selected.

Here I will show the rule that will apply in our expression, and the instantiation of that rule to our selected sub-expression.

  • Selected function:
  • Path:
  • Rule:
  • Instantiation of the rule:

So, the resulting tree will be:

Resulting function: .

Proof of

  • Selected function:
  • Path:
  • Rule:
  • Instantiation of the rule:

So, the resulting tree will be:

Resulting function: .

Proof of

  • Selected function:
  • Path:
  • Rule:
  • Instantiation of the rule:

So, the resulting tree will be:

Resulting function: .

Proof of

  • Selected function:
  • Path:
  • Rule:
  • Instantiation of the rule:

So, the resulting tree will be:

Resulting function: .

Proof of

  • Selected function:
  • Path:
  • Rule:
  • Instantiation of the rule:

So, the resulting tree will be:

Resulting function: .

Proof of

  • Selected function:
  • Path:
  • Rule:
  • Instantiation of the rule:

So, the resulting tree will be:

Resulting function: .

Proof of

  • Selected function:
  • Path:
  • Rule:
  • Instantiation of the rule:

So, the resulting tree will be:

Resulting function: .

Proof of

  • Selected function:
  • Path:
  • Rule:
  • Instantiation of the rule:

So, the resulting tree will be:

Resulting function: .

From the rule of equality that we have is true and therefore it is true

Presentation

Time to show the software interface.

The program’s interface is divided into 3 parts:

Input

Here the user can insert something that want to prove, for example:



and so on…

When you hit enter, you start having fun calculating 🙂

Rules that may apply

Here you can navigate in the rules pressing PgUp or PgDn, and enter to aply the selected rule to the selected sub-expression.

Proof state

Here you can browse the expression imagining it as the trees that showed earlier.

  • up/down – to navigate into the levels
  • left/right – no navigate into childs
  • backspace – to make undo in the proof

We can also save and load proofs into XML, and also save proofs to PDF.

Conclusion

Me and João still improving the pointfree calculator, and in the next year it will be used in one course here in Universidade do Minho.
The next stage we will implement type inference in the pointfree calculator, to make it powerfull. A lot of things have to be done, and in the next month we will start doing that, including make the first public release of the software.

If you understand Portuguese and want to see the presentation that we give in Department of Informatic in University of Minho: