From LALR(1) to LL(k)

2 04 2011

I, Pedro Faria and Jose Pedro Silva have been working on a LALR to LL conversor. Our idea was to discover the problems involved in this type of automatic conversion.
Our main goal here was to receive an LALAR(1) grammar and produce an LL(k) one. As a test case we receive an Yacc and produce ANTLR.
For the purpose of testing we used the LogoLISS grammar, a language inspired in LOGO programming language and developed in my university to school grammars and program transformation.

The main Yacc grammar was kind of a big one, here it is:

Liss            : PROGRAM IDENT '{' Body '}'
                ;
Body            : DECLARATIONS Declarations
                     STATEMENTS Statements
                ;
Declarations    : Declaration
                | Declarations Declaration
                ;
Declaration     : Variable_Declaration
                ;
Variable_Declaration : Vars SETA Type ';'
                     ;
Vars        : Var
            | Vars ',' Var
            ;
Var         : IDENT Value_Var
            ;
Value_Var   :
            | '=' Inic_Var
            ;
Type        : INTEGER
            | BOOLEAN
            | ARRAY SIZE NUM
            ;
Inic_Var    : Constant
            | Array_Definition
            ;
Constant    : '(' Sign NUM ')'
            | TRUE
            | FALSE
            ;
Sign        :
            | '+'
            | '-'
            ;
Array_Definition  : '[' Array_Initialization ']'
                  ;
Array_Initialization  : Elem
                      | Array_Initialization ',' Elem
                      ;
Elem        : Sign NUM
            ;
Statements      : Statement
                | Statements Statement
                ;
Statement   : Turtle_Commands
            | Assignment
            | Conditional_Statement
            | Iterative_Statement
            ;
Turtle_Commands  : Step
                 | Rotate
                 | Mode
                 | Dialogue
                 | Location
                 ;
Step     : FORWARD Expression
         | BACKWARD Expression
         ;
Rotate   : RRIGHT
         | RLEFT
         ;
Mode     : PEN UP
         | PEN DOWN
         ;
Dialogue : Say_Statement
         | Ask_Statement
         ;
Location : GOTO NUM ',' NUM
         | WHERE '?'
         ;
Assignment      : Variable '=' Expression
                ;
Variable        : IDENT Array_Acess
                ;
Array_Acess     :
                | '[' Single_Expression ']'
                ;
Expression      : Single_Expression
                | Expression Rel_Op Single_Expression
                ;
Single_Expression  : Term
                   | Single_Expression Add_Op Term
                   ;
Term        : Factor
            | Term Mul_Op Factor
            ;
Factor      : Constant
            | Variable
            | SuccOrPred
            | '(' '!' Expression ')'
            | '(' '+' Expression ')'
            | '(' '-' Expression ')'
            | '(' Expression ')'
            ;
Add_Op      : '+'
            | '-'
            | OU
            ;
Mul_Op      : '*'
            | '/'
            | E
            | MULTMULT
            ;
Rel_Op      : IGUAL
            | DIF
            | '<'             | '>'
            | MENORIGUAL
            | MAIORIGUAL
            | IN
            ;
SuccOrPred  : SuccPred IDENT
            ;
SuccPred    : SUCC
            | PRED
            ;
Say_Statement  : SAY '(' Expression ')'
               ;
Ask_Statement  : ASK '(' STRING ',' Variable ')'
               ;
Conditional_Statement : IfThenElse_Stat
                      ;
Iterative_Statement : While_Stat
                    ;
IfThenElse_Stat     : IF Expression THEN '{' Statements '}' Else_Expression
                    ;
Else_Expression     :
                    | ELSE '{' Statements '}'
                    ;
While_Stat     : WHILE '(' Expression ')' '{' Statements '}'
               ;

This is a toy grammar to help students to understand and get a deep knoledge of some fundamental types of productions in grammars.
The kind of texts this grammar is able to analyse are:

PROGRAM logolissExample {
    DECLARATIONS
    x = (100) , y -> Integer ;
    z -> Boolean ;
    w = TRUE -> Boolean ;
    STATEMENTS
        FORWARD x
        RRIGHT
        y = (100)
        FORWARD e
        RRIGHT
        x = x - (100) + (20)
        FORWARD x + (100)
        RRIGHT
        FORWARD (100)
}

Rules

After some time thinking about this problem we tried to solve the left recursion problem. ANTLR does not know how to handle with left recursion, so we must use the EBNF support to translate this productions.

Grammar rules in BNF provide for concatenation and choice but no specific operation equivalent to the * of regular expressions are provided. In Yacc the only way we get repetition is using the following pattern:

A : A a | a ;

We call this kind of rules a left recursive rule.

So, we discover this two generalized rules to remove left recursion in Yacc grammars (click on the image to expand):

\begin{tabular}{|l|l|c|}\hline\textbf{LALR (BNF)} & \textbf{LL (EBNF)} & Nome da Regra\\\hline\begin{tabular}{ccl}vars & : & var\\& | & vars a b c d e \ldots\end{tabular}&\begin{tabular}{ccl}vars & : & var (a b c d e \ldots)+\end{tabular}&removeLeftRecursionPlusExt\\\hline\begin{tabular}{ccl}vars & : & \\&| & vars a b c d e \ldots\end{tabular}&\begin{tabular}{ccl}vars & : & (a b c d e \ldots)*\end{tabular}&removeLeftRecursionTimesExt\\\hline\end{tabular}

Implementation

So, we start right over the implementation of the conversor. We used Yapp because lately we have been using a lot of Perl, so we are want to get deeper into Perl packages.

We start to implement the definition of a Yacc grammar in Yapp:

productions : production             
            | productions production 
            ;
production  : nonTerminal ':' derivs ';' 
            ;
derivs      : nts            
            | derivs '|' nts 
            |
            ;
nts         : nt     
            | nts nt 
            ;
nt          : terminal    
            | nonTerminal 
            | sep         
            ;
terminal    : STRING_TERMINAL     
            ;
nonTerminal : STRING_NON_TERMINAL
            ;
sep         : SEPARATOR 
            ;

And we move along to represent this information in a structure. We chose an peculiar structure to help us processing it.
This is the structure defined as Haskell types:

type Grammar = [Production]
type Production = HashTable ProductionName [Derivation]
type Derivation = [HashTable NTS Type]
type Type = String
type NTS = String

Easy to find stuff, easy to implement in Perl 🙂
So, this is our Yapp grammar again but this time with semantic actions:

productions : production             { return $_[1]; }
            | productions production {
                                       push @{$_[1]},@{$_[2]};
                                       return $_[1];
                                     }
            ;

production  : nonTerminal ':' derivs ';' { return [ { $_[1] => $_[3] } ]; }
            ;

derivs      : nts            { return [$_[1]]; }
            | derivs '|' nts {
                               push @{$_[1]},$_[3];
                               return $_[1];
                             }
            |                { return [[ { 'epsilon' => 'epsilon' } ]]; }
            ;
nts         : nt     { return $_[1]; }
            | nts nt {
                       push @{$_[1]},@{$_[2]};
                       return $_[1];
                     }
            ;
nt          : terminal    {
                            return [ { $_[1] => 'terminal' } ];
                          }
            | nonTerminal {
                            return [ { $_[1] => 'nonTerminal' } ];
                          }
            | sep         {
                            return [ { $_[1] => 'sep' } ];
                          }
            ;
terminal    : STRING_TERMINAL     {
                                    return $_[1];
                                  }
            ;
nonTerminal : STRING_NON_TERMINAL {
                                    return $_[1];
                                  }
            ;
sep         : SEPARATOR           {
                                    return $_[1];
                                  }
            ;

After we had this structure filled in we applied the two rules shown before and we get rid of left recursion for good.

So, after we process the following grammars with our transformer:

a   : a B C
    | //nothing
    ;
expr : expr "+" expr
     | expr "-" expr
     | expr "*" expr
     ;

We get:

a   : (B C)*
    ;
expr : ( "+" expr)+
     | ( "-" expr)+
     | ( "*" expr)+
     ;

You can see all the code produced (including makefiles) in my github repo.

Other rules

We have noticed that only removing the left recursion makes this grammar work in ANTLR.
We are aware that to fully convert an LALR grammar to LL we need one other step: Left Factoring.

Because our LL system is k-token lookahead we does not have to mind about ambiguous grammars.
Examples like this are not a problem for our destination system.

expr: T A B
    | T C D
    ;

However if our destination grammar was LL(1) we needed to convert this to it’s EBNF form.

Lexer

We also translated the Lex used in flex to ANTLR. The main problem here is that we used only matching functions in Perl and we do not used any grammar for this effect.
The main problem is that the regular expressions in flex are different to regexps in ANTLR.
You can see the Perl code for yourself here.
But you if you want to do a translation from flex to ANTLR better you define a flex grammar.





Lex/yacc

31 07 2008

Intro

Since long ago that I wanted to write about this. Finally I now have the time to write about syntactic analyzers generators.

When I was studding yacc noticed that there was much information available, not something one would expect of a tool with 40 years. Unfortunately there was no quick learning for someone who already knew the whole theory behind this kind of tools. As I said, the yacc is a very old tool, made when was needed compilers. Before there the whole theory of that area that we know today. So it is normal small errors transform themselves into disasters that may take a long time to be resolved.

This post is for those who already know what grammars are, lexical analyzers, etc., but that never touch in Lex/yacc or for those who are stuck with this things (I hope).

In this post I will explain to you Lex/yacc, with an example of a program that I did. I will not boring you with massive lines of code (like I usually do), because Lex/yacc is just a simple tool to use. With just 60 lines of explained code you will see how great that is.

The problem

My idea was to do a simple site with all the information about all the cities in Portugal. After reminded me to so but to all parishes (Freguesia in Portuguese), counties(Concelho in Portuguese) and districts(Distrito in Portuguese) of Portugal and their relationship. A district has many counties and one county has many parishes.

Simple! What I needed now was all the names of all parishes, counties and districts of Portugal, I eventually found three files with all this information and more in the Portuguese company of post office website. Better yet, these files had the relationship between each division. I have to do a small program in C that take those three files and create a new file, which I use on the Lex / yacc.

In order to have all available information on each local od Portugal I decided to use Wikipedia. This is the Achilles heel of the program, because unfortunately the entries of wikipedia on the Portuguese parishes are not uniform and many of the parishes or have entry. But still able to have good results.

The structure of the new file generated is:

District1>WikipediaEntry_for_District1
        Parish1_of_Countie1>WikipediaEntry_for_Parish1_of_Countie1,
        Parish2_of_Countie1>WikipediaEntry_for_Parish2_of_Countie1,
        Parish3_of_Countie1>WikipediaEntry_for_Parish3_of_Countie1,
        .
        .
        .
        ParishN_of_Countie1>WikipediaEntry_for_ParishN_of_Countie1
        !Countie1>WikipediaEntry_for_Countie1;
        Parish1_of_Countie2>WikipediaEntry_for_Parish1_of_Countie2,
        Parish2_of_Countie2>WikipediaEntry_for_Parish2_of_Countie2,
        Parish3_of_Countie2>WikipediaEntry_for_Parish3_of_Countie2,
        .
        .
        .
        ParishN_of_Countie1>WikipediaEntry_for_ParishN_of_Countie2
        !Countie2>WikipediaEntry_for_Countie2;
                .
                .
                .
        !CountieN>WikipediaEntry_for_CountieN|

District2>WikipediaEntry_for_District2
        Parish1_of_Countie1>WikipediaEntry_for_Parish1_of_Countie1,
        Parish2_of_Countie1>WikipediaEntry_for_Parish2_of_Countie1,
        Parish3_of_Countie1>WikipediaEntry_for_Parish3_of_Countie1,
        .
        .
        .
        ParishN_of_Countie1>WikipediaEntry_for_ParishN_of_Countie1
        !Countie1>WikipediaEntry_for_Countie1;
        Parish1_of_Countie2>WikipediaEntry_for_Parish1_of_Countie2,
        Parish2_of_Countie2>WikipediaEntry_for_Parish2_of_Countie2,
        Parish3_of_Countie2>WikipediaEntry_for_Parish3_of_Countie2,
        .
        .
        .
        ParishN_of_Countie1>WikipediaEntry_for_ParishN_of_Countie2
        !Countie2>WikipediaEntry_for_Countie2;
                .
                .
                .
        !CountieN>WikipediaEntry_for_CountieN|
.
.
.

It have about 43000 lines, the number of parishes in Portugal.

Yacc

An Yacc file is just like an lex file, it is divided in three parts:

DECLARATIONS
%%
GRAMMAR
%%
FUNCTIONS

I Will explain later on, what to put in all those three parts.

The solution

I start to write the following grammar to describe my new file:

OrGeo     -> Districts

Districts -> District '|'
          | Districts District '|'

Distrito  -> IdD Link Counties

Counties -> Countie
          | Counties ';' Countie

Countie  -> Parishes '!' IdC Link

Parishes    -> Parish
          | Parishes ',' Parish

Link      -> '>' l

IdD       -> id

IdC       -> id

Parish     -> IdL Link

IdL       -> id

As id is a name (of the District, Countie or Parish), we will declare it in yacc as a pointer to chars (vals). To do that we create a union like in the first part of yacc file (DECLARATIONS), and add a association with vals with id and l:

%union {
        char *vals;
}

%token  id l

Yacc use that union because you can declare as many associations as you want. We must refer to that union in the lex as yylval.

This is the same as use:

%token  id
%token  l

We now go to lex file and add the rule that the lexical analyzer will meet when find text.

[ a-zA-ZÀ-ú-'.()/'`0-9]+  { yylval.vals = strdup(yytext);
                                    return id; }
[a-zA-ZÀ-ú-'.()/'`0-9_:]+ { yylval.vals = strdup(yytext);
                                    return l; }

Here we are saying that when lex find some text that fills in that regular expression it will return to yacc as an id, so that way we find the names of our cities, or links.

As you can see we have special symbols in our grammar, (!>,;|), so we need to say to lex to return them to yacc, where we need them:

[!>,;|]                           { return yytext[0]; }

I also will say to lex ignore all n and t characters:

[tn]                            { ; }

Making our grammar powerfull

Now our grammar will suffer some adjustments; we will say yacc what to do when it was in some of the derivations of some rule:

OrGeo     : Districts { d = $1; }
          ;
Districts : District '|' { $$ = $1; }
          | Districts District '|' { $$ = catDistricts($1,$2); }
          ;
District  : IdD Link Counties { $$ = addDistrict($1, $2, $3); }
          ;
Counties : Countie { $$ = $1; }
          | Counties ';' Countie { $$ = catCounties($1, $3); }
          ;
Countie  : Parishes '!' IdC Link { $$ = addCountie($1, $3, $4); }
          ;
Parishes    : Parish { $$ = $1; }
          | Parishes ',' Parish { $$ = catParishes($1, $3); }
          ;
Link      : '>' l { $$ = $2; }
          ;
IdD       : id { $$ = $1; }
          ;
IdC       : id { $$ = $1; }
          ;
Parish     : IdL Link { $$ = addParish($1, $2); }
          ;
IdL       : id { $$ = $1; }
          ;

Here we tell yacc how to behave when pass certain derivation of a rule.
We can tell yacc that some rule can return a data type, for example:

%{
#include 
#include 
#include "district.h"

District *d;
%}

%union {
        char *vals;
        Parish *valf;
        Countie *valc;
        District *vald;
}

%type   Link IdL IdD IdC
%type   Parish Parishes
%type   Countie Counties
%type   District Districts

To return something in a rule we refer to that rule as $$, that mean IdL in

IdL       : id

and $1 to refer to that id, and so on. So, that way we can say that $$ = $1, that means, IdL = id.

Funtions catDistricts, addDistrict, addParish, addCountie, catCounties and catParishes, are just functions to create linked list’s and append new elemt’s in one already existent linked list.

The result is a html page with nested liked lists, here is the result.

Notes

All the code, including the code to generate the new file, used to yacc is available here, fell free to use it.