Lex/yacc

31 07 2008

Intro

Since long ago that I wanted to write about this. Finally I now have the time to write about syntactic analyzers generators.

When I was studding yacc noticed that there was much information available, not something one would expect of a tool with 40 years. Unfortunately there was no quick learning for someone who already knew the whole theory behind this kind of tools. As I said, the yacc is a very old tool, made when was needed compilers. Before there the whole theory of that area that we know today. So it is normal small errors transform themselves into disasters that may take a long time to be resolved.

This post is for those who already know what grammars are, lexical analyzers, etc., but that never touch in Lex/yacc or for those who are stuck with this things (I hope).

In this post I will explain to you Lex/yacc, with an example of a program that I did. I will not boring you with massive lines of code (like I usually do), because Lex/yacc is just a simple tool to use. With just 60 lines of explained code you will see how great that is.

The problem

My idea was to do a simple site with all the information about all the cities in Portugal. After reminded me to so but to all parishes (Freguesia in Portuguese), counties(Concelho in Portuguese) and districts(Distrito in Portuguese) of Portugal and their relationship. A district has many counties and one county has many parishes.

Simple! What I needed now was all the names of all parishes, counties and districts of Portugal, I eventually found three files with all this information and more in the Portuguese company of post office website. Better yet, these files had the relationship between each division. I have to do a small program in C that take those three files and create a new file, which I use on the Lex / yacc.

In order to have all available information on each local od Portugal I decided to use Wikipedia. This is the Achilles heel of the program, because unfortunately the entries of wikipedia on the Portuguese parishes are not uniform and many of the parishes or have entry. But still able to have good results.

The structure of the new file generated is:

District1>WikipediaEntry_for_District1
        Parish1_of_Countie1>WikipediaEntry_for_Parish1_of_Countie1,
        Parish2_of_Countie1>WikipediaEntry_for_Parish2_of_Countie1,
        Parish3_of_Countie1>WikipediaEntry_for_Parish3_of_Countie1,
        .
        .
        .
        ParishN_of_Countie1>WikipediaEntry_for_ParishN_of_Countie1
        !Countie1>WikipediaEntry_for_Countie1;
        Parish1_of_Countie2>WikipediaEntry_for_Parish1_of_Countie2,
        Parish2_of_Countie2>WikipediaEntry_for_Parish2_of_Countie2,
        Parish3_of_Countie2>WikipediaEntry_for_Parish3_of_Countie2,
        .
        .
        .
        ParishN_of_Countie1>WikipediaEntry_for_ParishN_of_Countie2
        !Countie2>WikipediaEntry_for_Countie2;
                .
                .
                .
        !CountieN>WikipediaEntry_for_CountieN|

District2>WikipediaEntry_for_District2
        Parish1_of_Countie1>WikipediaEntry_for_Parish1_of_Countie1,
        Parish2_of_Countie1>WikipediaEntry_for_Parish2_of_Countie1,
        Parish3_of_Countie1>WikipediaEntry_for_Parish3_of_Countie1,
        .
        .
        .
        ParishN_of_Countie1>WikipediaEntry_for_ParishN_of_Countie1
        !Countie1>WikipediaEntry_for_Countie1;
        Parish1_of_Countie2>WikipediaEntry_for_Parish1_of_Countie2,
        Parish2_of_Countie2>WikipediaEntry_for_Parish2_of_Countie2,
        Parish3_of_Countie2>WikipediaEntry_for_Parish3_of_Countie2,
        .
        .
        .
        ParishN_of_Countie1>WikipediaEntry_for_ParishN_of_Countie2
        !Countie2>WikipediaEntry_for_Countie2;
                .
                .
                .
        !CountieN>WikipediaEntry_for_CountieN|
.
.
.

It have about 43000 lines, the number of parishes in Portugal.

Yacc

An Yacc file is just like an lex file, it is divided in three parts:

DECLARATIONS
%%
GRAMMAR
%%
FUNCTIONS

I Will explain later on, what to put in all those three parts.

The solution

I start to write the following grammar to describe my new file:

OrGeo     -> Districts

Districts -> District '|'
          | Districts District '|'

Distrito  -> IdD Link Counties

Counties -> Countie
          | Counties ';' Countie

Countie  -> Parishes '!' IdC Link

Parishes    -> Parish
          | Parishes ',' Parish

Link      -> '>' l

IdD       -> id

IdC       -> id

Parish     -> IdL Link

IdL       -> id

As id is a name (of the District, Countie or Parish), we will declare it in yacc as a pointer to chars (vals). To do that we create a union like in the first part of yacc file (DECLARATIONS), and add a association with vals with id and l:

%union {
        char *vals;
}

%token  id l

Yacc use that union because you can declare as many associations as you want. We must refer to that union in the lex as yylval.

This is the same as use:

%token  id
%token  l

We now go to lex file and add the rule that the lexical analyzer will meet when find text.

[ a-zA-ZÀ-ú-'.()/'`0-9]+  { yylval.vals = strdup(yytext);
                                    return id; }
[a-zA-ZÀ-ú-'.()/'`0-9_:]+ { yylval.vals = strdup(yytext);
                                    return l; }

Here we are saying that when lex find some text that fills in that regular expression it will return to yacc as an id, so that way we find the names of our cities, or links.

As you can see we have special symbols in our grammar, (!>,;|), so we need to say to lex to return them to yacc, where we need them:

[!>,;|]                           { return yytext[0]; }

I also will say to lex ignore all n and t characters:

[tn]                            { ; }

Making our grammar powerfull

Now our grammar will suffer some adjustments; we will say yacc what to do when it was in some of the derivations of some rule:

OrGeo     : Districts { d = $1; }
          ;
Districts : District '|' { $$ = $1; }
          | Districts District '|' { $$ = catDistricts($1,$2); }
          ;
District  : IdD Link Counties { $$ = addDistrict($1, $2, $3); }
          ;
Counties : Countie { $$ = $1; }
          | Counties ';' Countie { $$ = catCounties($1, $3); }
          ;
Countie  : Parishes '!' IdC Link { $$ = addCountie($1, $3, $4); }
          ;
Parishes    : Parish { $$ = $1; }
          | Parishes ',' Parish { $$ = catParishes($1, $3); }
          ;
Link      : '>' l { $$ = $2; }
          ;
IdD       : id { $$ = $1; }
          ;
IdC       : id { $$ = $1; }
          ;
Parish     : IdL Link { $$ = addParish($1, $2); }
          ;
IdL       : id { $$ = $1; }
          ;

Here we tell yacc how to behave when pass certain derivation of a rule.
We can tell yacc that some rule can return a data type, for example:

%{
#include 
#include 
#include "district.h"

District *d;
%}

%union {
        char *vals;
        Parish *valf;
        Countie *valc;
        District *vald;
}

%type   Link IdL IdD IdC
%type   Parish Parishes
%type   Countie Counties
%type   District Districts

To return something in a rule we refer to that rule as $$, that mean IdL in

IdL       : id

and $1 to refer to that id, and so on. So, that way we can say that $$ = $1, that means, IdL = id.

Funtions catDistricts, addDistrict, addParish, addCountie, catCounties and catParishes, are just functions to create linked list’s and append new elemt’s in one already existent linked list.

The result is a html page with nested liked lists, here is the result.

Notes

All the code, including the code to generate the new file, used to yacc is available here, fell free to use it.

Advertisements




Lexical analysis Wikipedia

13 04 2008

Intro

This year I started to learn processing languages. I started by regular expressions and past few days I began to study the Flex, as with regular expressions we can’t create text filters.
The first job I did was a kind of a dictionary. Getting a source of words and a faithful translation, add all the information in one document.

The problem was finding a good document with many Portuguese words and their translations into English.

With this post I want to teach what I learned from this work.

Wikipedia XML structure

I started by picking up the last dump of Wikipedia-PT and “decipher” the XML where it is stored. The structure is something like that:

<page>
    ...
</page>
<page>
    ...
</page>

And each page tag, expanded, have this structure:

<page>
    <title>TITLE</title>
    <id>PAGE_ID_NUMBER</id>
    <revision>
      <id>ID_REVISION_NUMBER</id>
      <timestamp>TIMESTAMP</timestamp>
      <contributor>
        <username>USERNAME</username>
        <id>ID_CONTRIBUTOR</id>
      </contributor>
      <comment>COMMENT</comment>
      <text xml:space="preserve">WIKIPEDIA_ENTRY</text>
    </revision>
  </page>

So, here we have the variables: TITLE, PAGE_ID_NUMBER, ID_REVISION_NUMBER, TIMESTAMP, USERNAME, ID_CONTRIBUTOR, COMMENT and WIKIPEDIA_ENTRY.

TITLE is the Portuguese word, because the Wikipedia I’ve downloaded is in Portuguese.
But so far, no English word.

Lets expand WIKIPEDIA_ENTRY:

<text xml:space="preserve">
    ...
    [[categoria:CATEGORY]]
    ...
    [[en:ENGLISH]]
    ...
</text>

Here we have the ENGLISH variable that is the corresponding word in English to TITLE. I also want the CATEGORY variable that indicates from what category this entry belong.

As some entries in the Wikipedia have multiple categories lines I am also interested in keep them all.
I want that the output of my program became something like that:

PT - TITLE1
EN - ENGLISH1
Categoria - CATEGORY11
            CATEGORY12
            ...
            CATEGORY1j
...
PT - TITLEi
EN - ENGLISHi
Categoria - CATEGORYi1
            CATEGORYi2
            ...
            CATEGORYij
...

Some entries in the Portuguese Wikipedia do not have the English correspondent version so I will not want those entries.

Lexing

A Lex file have this aspect:

definitions
%%
rules
%%
user code

Let’s focus in rules part:

Rules have the following structure;

%%
REGEX    code
REGEX    code
%%

I know Regular Expressions (REGEX) now, so let’s start to build that thing!

I got me to realize that the part of the TITLE may have not only the entries name, also has the Wikipedia contributors pages, among other things that do not interest me to save.
I start to do a list of all those pages:
{Wikipedia,Usuário,Discussão,Ajuda,Anexo,MediaWiki,Categoria}

So, I have to, somehow, throw away all the pages with the following structure:

  <page>
    <title>["Wikipedia""Usuario""Discussao""Ajuda""Anexo""MediaWiki""Categoria"]:TITLE</title>
    <id>ID_PAGE</id>
    <revision>
      <id>ID_REVISION</id>
      <timestamp>TIMESTAMP</timestamp>
      <contributor>
        <username>USERNAME</username>
        <id>ID_CONTRIBUTOR</id>
      </contributor>
      <comment>COMMENT</comment>
      <text xml:space="preserve">ENTRY</text>
    </revision>
  </page>

After a dozen of lines I start to understand that I have to some how “explain” Lex the structure of the Wikipedia XML file. That way will be easier.

I start to read the Flex manual and I find the Start Conditions, a very clever way to treat a block of information.

Languages like C, HTML and XML are structured by blocks, so Start Conditions may be the way to easily get the information from those.

Start Conditions

If you have a block of code that have this aspect:

<title>TITLE</title>

Our block start with “” and then have a name (that I want) until ‘>’ and then ended with “” string.

So in Lex, we must use Start Conditions and produce the following code:

%x title_sc
anything .|[\n\t\r]
%%
"<title>"            BEGIN(title_sc);
<title_sc>[^>]+      {printf("title=%s\n",yytext);}
<title_sc>"</title>" BEGIN(INITIAL);
{anything}           {;} /* do nothing*/
%%
main() {
        yylex();
}

The %x title_sc declaration is to declare a state exclusive, that means; while we are inside code_sc Flex won’t look rules outside of that, until BEGIN(INITAIL).

In definitions part we can declare variables anything .|[\n\t\r] to use in rules part as {anything}.

The BEGIN(title_sc) statement makes Lex jump to the first line of rule and it start to match the rules that finds there.

We can rewrite the above code like that:

%x title_sc
anything .|[\n\t\r]
%%
"<title>"            BEGIN(title_sc);
<title_sc>{
        [^>]+        {printf("title=%s\n",yytext);}
        "</title>"   BEGIN(INITIAL);
}
{anything}           {;} /* do nothing*/
%%
main() {
        yylex();
}

When Lex find BEGIN(INITIAL) statement it jumps to the first BEGIN(XXX), so we can never be able to use block’s inside other block’s (like XML does).

Of course that’s not true.

Start Conditions inside Start Conditions

Lex have a brilliant way to deal with that. It uses Stack’s to store the state were we are!

The idea is something like that, imagine a mathematical expression:

(1+2)-(3*(4/5))

I can say:

  • 2 or [1,2]
  • 3 or [2,1]
  • 5 or [2,2,2]
  • and so on…

That’s all about keeping the path, our actual position in the tree.

So, now we replace the BEGIN(state) to yy_push_state(state), and to go to the previously block I say yy_pop_state().

With that now I can read structures like that one:

  <page>
    <title>TITLE</title>
      ...
      <text xml:space="preserve">
        ...
        [[Categoria:CATEGORY]]
        ...
        [[en:ENGLISH]]
        ...
      </text>
  </page>

And to do so, I write that Lex code:

%x PAGE TITLE TEXT CATEGORIA EN
%option stack
anything .|[\n\t\r]
notPage ["Wikipedia""Usuário""Discussão""Ajuda""Anexo""MediaWiki""Categoria"]
%%
"<page>"                        yy_push_state(PAGE);
<PAGE>{
        "<title>"{notPagina}    yy_pop_state(); // not a valid page
        "<title>"               yy_push_state(TITLE);
        "<text"[^>]+">"         yy_push_state(TEXT);
        "</page>"               yy_pop_state();
        {anything}              /* do nothing */
}
 
<TEXT>{
        "[["[cC]"ategoria:"     yy_push_state(CATEGORIA);
        "[[en:"                 yy_push_state(EN);
        "</text>"               yy_pop_state();
        {anything}              /* do nothing */
}
 
<TITLE>{
        [^<]+                   {
                                        i=0;
                                        imprime(cat, pt, en);
                                        limpa(cat);
                                        pt=NULL; en=NULL;
                                        pt=strdup(yytext);
                                }
        "</title>"              yy_pop_state();
        {anything}              /* do nothing */
}
 
<EN>{
        [^\]]+                  en=strdup(yytext);
        [\]]+                   yy_pop_state();
        "]"\n{anything}         /* do nothing */
}
 
<CATEGORIA>{
        [ \#\!\*\|]+            yy_pop_state();
        [^\]\|\n]+              {
                                        cat[i]=strdup(yytext);
                                        i++;
                                }
        [\]]+                   yy_pop_state();
        "]"\n{anything}         /* do nothing */
}
{anything}                      /* do nothing */
%%
int main() {
        yylex();
        return 0;
}

As you can see we are all the time matching a rule that makes lex jump to another state until one terminal rule (in this case variables affectations) or until pop.

To see all the code go here(.lex).

If you understand Portuguese and feel interested in more information you can go here(.html).

References

flex: The Fast Lexical Analyzer





Parsing de expressões

30 09 2007

Vamos criar uma função que pega numa String, que tem representado ma expressão matemática, e passa para um determinado tipo de dados (Exp o).

A String que iremos passar á função para que seja processada será qualquer coisa do género:

"((1>=a) || (4==x))" e até "(2 -( ((1+3) * (4-x)) / 45))"

Comecemos então pelo inicio;

{-# OPTIONS -fglasgow-exts #-}
module ParserExp where
import Char

Escolhemos o nosso tipo de dados para representar, primeiro, os operadores:

data Ops = Add | Mul | Sim | Sub | Div
             | OR_ | AND_ | NOT_ | GE_
             | GT_ | LE_ | LT_ | EQ_ | NE_
        deriving Eq

Acabamos de definir 14 operadores matemáticos; Adição, Multiplicação, Simétrico, Subtracção, Divisão, ou lógico, e lógico, negação, maior que, menor que, maior ou igual a, menor ou igual a, igual e diferente.

Passemos agora por definir as nossas expressões como uma árvore n-ária:

data Exp o = Const Int | Var String | Op o [Exp o]

Agora podemos ver que o tipo de dados escolhido encaixa perfeitamente no que pretendemos, vejamos:

"((1>=a) || (4==x))" pode ser representado comoOp OR_ [Op GE_ [Const 1, Var "a"], Op EQ_ [Const 4, Var "x"]]

Temos que criar uma classe para sabermos a aridade de cada operador:

class Opt o where
        arity :: o -> Int
instance Opt Ops where
        arity Add = 2
        arity Mul = 2
        arity Sim = 1
        arity Sub = 2
        arity Div = 2
        arity OR_  = 2
        arity AND_ = 2
        arity NOT_ = 1
        arity GE_  = 2
        arity GT_  = 2
        arity LE_  = 2
        arity LT_  = 2
        arity EQ_  = 2
        arity NE_  = 2

De seguida fazemos as instancias de Show para os nossos tipos de dados, ficamos assim com:

instance Show Ops where
        show Add = "+"
        show Mul = "*"
        show Sim = "Sim"
        show Sub = "-"
        show Div = "/"
        show OR_  = "||"
        show AND_ = "&&"
        show NOT_ = "!"
        show GE_  = ">="
        show GT_  = ">"
        show LE_  = "<="
        show LT_  = " Show (Exp o) where
        show (Const n) = show n
        show (Var s)   = s
        show (Op o l)
                | arity o == 2 = "(" ++ (show $ head l) ++ show o ++ (show $ last l) ++ ")"
                | arity o == 1 = "(" ++ show o ++ (show $ head l) ++ ")"

Bom… vamos finalmente ao que nos interessa relamente, o nosso Parser!

Existe um tipo de dados em Haskell que foi pensado para parser, que é o ReadS, este apresenta a seguinte definição:

data ReadS a = String -> [(a,String)]

Ou seja, é fácil perceber que este tipo terá que ser a assinatura da nossa função, pois só desta forma conseguimos processar uma String, originando (sempre!) uma lista de tuplos em que a primeira parte do tuplo tem o nosso tipo desejado (árvore n-ária) e a segunda parte tem o resto da String que ainda não foi processada:

De lembrar que o nosso tipo de dados é (Exp Ops) e não apenas Exp.
Sendo assim temos:

readsExp :: ReadS (Exp Ops)
readsExp s =
        [((leOp "~" [a]),p4) |  ("(",p1)  <- lex s,
                                       ("~",p2) <- lex p1,
                                       (a,p3)    <- readsExp p2,
                                       (")",p4)  <- lex p3] ++
        [((leOp op [a,b]),p5) | ("(",p1)  <- lex s,
                                       (a,p2)    <- readsExp p1,
                                       (op,p3)  <- lex p2,
                                       op == "+"  || op == "*"  || op == "/"  ||
                                       op == "-"  || op == "||" || op == "&&" ||
                                       op == "==" || op == "!=" || op == "=" || op == "!"  || op == ">"  ||
                                       op == "<",
                                       (b,p4)   <- readsExp p3,
                                       (")",p5) <- lex p4 ] ++
        [((Const ((read a)::Int)),sx) | (a,sx) <- lex s, all isDigit a] ++
        [((Var a),sx)                 | (a,sx)  [Exp Ops] -> Exp Ops
                leOp o = Op (read o::Ops)

Acreditem que o nosso parser resume-se a estas 21 linhas (com indentação exagerada).

Aqui temos que dar os créditos ás listas por compreensão e depois á função lex que vem no Prelude. É incrível o poder desta função.
Aconselho a verem mais sobre esta função.

E finalmente temos a instancia de Read para Exp Ops:

instance Read (Exp Ops) where
        readsPrec _ s = readsExp s

Apartir de agora sempre que quisermos fazer parsing a uma expressão utilizamos:

parsing :: String -> Exp Ops
parsing = read

Ao usar a função parsing asseguramo-nos de que iremos receber sempre uma Exp Ops que é o desejado.

Este parser é bastante simples e como tal não faz reconhecimento de erros ou más formações na String recebida.

É tudo!

código aqui