Lex/yacc

31 07 2008

Intro

Since long ago that I wanted to write about this. Finally I now have the time to write about syntactic analyzers generators.

When I was studding yacc noticed that there was much information available, not something one would expect of a tool with 40 years. Unfortunately there was no quick learning for someone who already knew the whole theory behind this kind of tools. As I said, the yacc is a very old tool, made when was needed compilers. Before there the whole theory of that area that we know today. So it is normal small errors transform themselves into disasters that may take a long time to be resolved.

This post is for those who already know what grammars are, lexical analyzers, etc., but that never touch in Lex/yacc or for those who are stuck with this things (I hope).

In this post I will explain to you Lex/yacc, with an example of a program that I did. I will not boring you with massive lines of code (like I usually do), because Lex/yacc is just a simple tool to use. With just 60 lines of explained code you will see how great that is.

The problem

My idea was to do a simple site with all the information about all the cities in Portugal. After reminded me to so but to all parishes (Freguesia in Portuguese), counties(Concelho in Portuguese) and districts(Distrito in Portuguese) of Portugal and their relationship. A district has many counties and one county has many parishes.

Simple! What I needed now was all the names of all parishes, counties and districts of Portugal, I eventually found three files with all this information and more in the Portuguese company of post office website. Better yet, these files had the relationship between each division. I have to do a small program in C that take those three files and create a new file, which I use on the Lex / yacc.

In order to have all available information on each local od Portugal I decided to use Wikipedia. This is the Achilles heel of the program, because unfortunately the entries of wikipedia on the Portuguese parishes are not uniform and many of the parishes or have entry. But still able to have good results.

The structure of the new file generated is:

District1>WikipediaEntry_for_District1
        Parish1_of_Countie1>WikipediaEntry_for_Parish1_of_Countie1,
        Parish2_of_Countie1>WikipediaEntry_for_Parish2_of_Countie1,
        Parish3_of_Countie1>WikipediaEntry_for_Parish3_of_Countie1,
        .
        .
        .
        ParishN_of_Countie1>WikipediaEntry_for_ParishN_of_Countie1
        !Countie1>WikipediaEntry_for_Countie1;
        Parish1_of_Countie2>WikipediaEntry_for_Parish1_of_Countie2,
        Parish2_of_Countie2>WikipediaEntry_for_Parish2_of_Countie2,
        Parish3_of_Countie2>WikipediaEntry_for_Parish3_of_Countie2,
        .
        .
        .
        ParishN_of_Countie1>WikipediaEntry_for_ParishN_of_Countie2
        !Countie2>WikipediaEntry_for_Countie2;
                .
                .
                .
        !CountieN>WikipediaEntry_for_CountieN|

District2>WikipediaEntry_for_District2
        Parish1_of_Countie1>WikipediaEntry_for_Parish1_of_Countie1,
        Parish2_of_Countie1>WikipediaEntry_for_Parish2_of_Countie1,
        Parish3_of_Countie1>WikipediaEntry_for_Parish3_of_Countie1,
        .
        .
        .
        ParishN_of_Countie1>WikipediaEntry_for_ParishN_of_Countie1
        !Countie1>WikipediaEntry_for_Countie1;
        Parish1_of_Countie2>WikipediaEntry_for_Parish1_of_Countie2,
        Parish2_of_Countie2>WikipediaEntry_for_Parish2_of_Countie2,
        Parish3_of_Countie2>WikipediaEntry_for_Parish3_of_Countie2,
        .
        .
        .
        ParishN_of_Countie1>WikipediaEntry_for_ParishN_of_Countie2
        !Countie2>WikipediaEntry_for_Countie2;
                .
                .
                .
        !CountieN>WikipediaEntry_for_CountieN|
.
.
.

It have about 43000 lines, the number of parishes in Portugal.

Yacc

An Yacc file is just like an lex file, it is divided in three parts:

DECLARATIONS
%%
GRAMMAR
%%
FUNCTIONS

I Will explain later on, what to put in all those three parts.

The solution

I start to write the following grammar to describe my new file:

OrGeo     -> Districts

Districts -> District '|'
          | Districts District '|'

Distrito  -> IdD Link Counties

Counties -> Countie
          | Counties ';' Countie

Countie  -> Parishes '!' IdC Link

Parishes    -> Parish
          | Parishes ',' Parish

Link      -> '>' l

IdD       -> id

IdC       -> id

Parish     -> IdL Link

IdL       -> id

As id is a name (of the District, Countie or Parish), we will declare it in yacc as a pointer to chars (vals). To do that we create a union like in the first part of yacc file (DECLARATIONS), and add a association with vals with id and l:

%union {
        char *vals;
}

%token  id l

Yacc use that union because you can declare as many associations as you want. We must refer to that union in the lex as yylval.

This is the same as use:

%token  id
%token  l

We now go to lex file and add the rule that the lexical analyzer will meet when find text.

[ a-zA-ZÀ-ú-'.()/'`0-9]+  { yylval.vals = strdup(yytext);
                                    return id; }
[a-zA-ZÀ-ú-'.()/'`0-9_:]+ { yylval.vals = strdup(yytext);
                                    return l; }

Here we are saying that when lex find some text that fills in that regular expression it will return to yacc as an id, so that way we find the names of our cities, or links.

As you can see we have special symbols in our grammar, (!>,;|), so we need to say to lex to return them to yacc, where we need them:

[!>,;|]                           { return yytext[0]; }

I also will say to lex ignore all n and t characters:

[tn]                            { ; }

Making our grammar powerfull

Now our grammar will suffer some adjustments; we will say yacc what to do when it was in some of the derivations of some rule:

OrGeo     : Districts { d = $1; }
          ;
Districts : District '|' { $$ = $1; }
          | Districts District '|' { $$ = catDistricts($1,$2); }
          ;
District  : IdD Link Counties { $$ = addDistrict($1, $2, $3); }
          ;
Counties : Countie { $$ = $1; }
          | Counties ';' Countie { $$ = catCounties($1, $3); }
          ;
Countie  : Parishes '!' IdC Link { $$ = addCountie($1, $3, $4); }
          ;
Parishes    : Parish { $$ = $1; }
          | Parishes ',' Parish { $$ = catParishes($1, $3); }
          ;
Link      : '>' l { $$ = $2; }
          ;
IdD       : id { $$ = $1; }
          ;
IdC       : id { $$ = $1; }
          ;
Parish     : IdL Link { $$ = addParish($1, $2); }
          ;
IdL       : id { $$ = $1; }
          ;

Here we tell yacc how to behave when pass certain derivation of a rule.
We can tell yacc that some rule can return a data type, for example:

%{
#include 
#include 
#include "district.h"

District *d;
%}

%union {
        char *vals;
        Parish *valf;
        Countie *valc;
        District *vald;
}

%type   Link IdL IdD IdC
%type   Parish Parishes
%type   Countie Counties
%type   District Districts

To return something in a rule we refer to that rule as $$, that mean IdL in

IdL       : id

and $1 to refer to that id, and so on. So, that way we can say that $$ = $1, that means, IdL = id.

Funtions catDistricts, addDistrict, addParish, addCountie, catCounties and catParishes, are just functions to create linked list’s and append new elemt’s in one already existent linked list.

The result is a html page with nested liked lists, here is the result.

Notes

All the code, including the code to generate the new file, used to yacc is available here, fell free to use it.


Actions

Information

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




%d bloggers like this: