Lexical analysis Wikipedia

13 04 2008

Intro

This year I started to learn processing languages. I started by regular expressions and past few days I began to study the Flex, as with regular expressions we can’t create text filters.
The first job I did was a kind of a dictionary. Getting a source of words and a faithful translation, add all the information in one document.

The problem was finding a good document with many Portuguese words and their translations into English.

With this post I want to teach what I learned from this work.

Wikipedia XML structure

I started by picking up the last dump of Wikipedia-PT and “decipher” the XML where it is stored. The structure is something like that:

<page>
    ...
</page>
<page>
    ...
</page>

And each page tag, expanded, have this structure:

<page>
    <title>TITLE</title>
    <id>PAGE_ID_NUMBER</id>
    <revision>
      <id>ID_REVISION_NUMBER</id>
      <timestamp>TIMESTAMP</timestamp>
      <contributor>
        <username>USERNAME</username>
        <id>ID_CONTRIBUTOR</id>
      </contributor>
      <comment>COMMENT</comment>
      <text xml:space="preserve">WIKIPEDIA_ENTRY</text>
    </revision>
  </page>

So, here we have the variables: TITLE, PAGE_ID_NUMBER, ID_REVISION_NUMBER, TIMESTAMP, USERNAME, ID_CONTRIBUTOR, COMMENT and WIKIPEDIA_ENTRY.

TITLE is the Portuguese word, because the Wikipedia I’ve downloaded is in Portuguese.
But so far, no English word.

Lets expand WIKIPEDIA_ENTRY:

<text xml:space="preserve">
    ...
    [[categoria:CATEGORY]]
    ...
    [[en:ENGLISH]]
    ...
</text>

Here we have the ENGLISH variable that is the corresponding word in English to TITLE. I also want the CATEGORY variable that indicates from what category this entry belong.

As some entries in the Wikipedia have multiple categories lines I am also interested in keep them all.
I want that the output of my program became something like that:

PT - TITLE1
EN - ENGLISH1
Categoria - CATEGORY11
            CATEGORY12
            ...
            CATEGORY1j
...
PT - TITLEi
EN - ENGLISHi
Categoria - CATEGORYi1
            CATEGORYi2
            ...
            CATEGORYij
...

Some entries in the Portuguese Wikipedia do not have the English correspondent version so I will not want those entries.

Lexing

A Lex file have this aspect:

definitions
%%
rules
%%
user code

Let’s focus in rules part:

Rules have the following structure;

%%
REGEX    code
REGEX    code
%%

I know Regular Expressions (REGEX) now, so let’s start to build that thing!

I got me to realize that the part of the TITLE may have not only the entries name, also has the Wikipedia contributors pages, among other things that do not interest me to save.
I start to do a list of all those pages:
{Wikipedia,Usuário,Discussão,Ajuda,Anexo,MediaWiki,Categoria}

So, I have to, somehow, throw away all the pages with the following structure:

  <page>
    <title>["Wikipedia""Usuario""Discussao""Ajuda""Anexo""MediaWiki""Categoria"]:TITLE</title>
    <id>ID_PAGE</id>
    <revision>
      <id>ID_REVISION</id>
      <timestamp>TIMESTAMP</timestamp>
      <contributor>
        <username>USERNAME</username>
        <id>ID_CONTRIBUTOR</id>
      </contributor>
      <comment>COMMENT</comment>
      <text xml:space="preserve">ENTRY</text>
    </revision>
  </page>

After a dozen of lines I start to understand that I have to some how “explain” Lex the structure of the Wikipedia XML file. That way will be easier.

I start to read the Flex manual and I find the Start Conditions, a very clever way to treat a block of information.

Languages like C, HTML and XML are structured by blocks, so Start Conditions may be the way to easily get the information from those.

Start Conditions

If you have a block of code that have this aspect:

<title>TITLE</title>

Our block start with “” and then have a name (that I want) until ‘>’ and then ended with “” string.

So in Lex, we must use Start Conditions and produce the following code:

%x title_sc
anything .|[\n\t\r]
%%
"<title>"            BEGIN(title_sc);
<title_sc>[^>]+      {printf("title=%s\n",yytext);}
<title_sc>"</title>" BEGIN(INITIAL);
{anything}           {;} /* do nothing*/
%%
main() {
        yylex();
}

The %x title_sc declaration is to declare a state exclusive, that means; while we are inside code_sc Flex won’t look rules outside of that, until BEGIN(INITAIL).

In definitions part we can declare variables anything .|[\n\t\r] to use in rules part as {anything}.

The BEGIN(title_sc) statement makes Lex jump to the first line of rule and it start to match the rules that finds there.

We can rewrite the above code like that:

%x title_sc
anything .|[\n\t\r]
%%
"<title>"            BEGIN(title_sc);
<title_sc>{
        [^>]+        {printf("title=%s\n",yytext);}
        "</title>"   BEGIN(INITIAL);
}
{anything}           {;} /* do nothing*/
%%
main() {
        yylex();
}

When Lex find BEGIN(INITIAL) statement it jumps to the first BEGIN(XXX), so we can never be able to use block’s inside other block’s (like XML does).

Of course that’s not true.

Start Conditions inside Start Conditions

Lex have a brilliant way to deal with that. It uses Stack’s to store the state were we are!

The idea is something like that, imagine a mathematical expression:

(1+2)-(3*(4/5))

I can say:

  • 2 or [1,2]
  • 3 or [2,1]
  • 5 or [2,2,2]
  • and so on…

That’s all about keeping the path, our actual position in the tree.

So, now we replace the BEGIN(state) to yy_push_state(state), and to go to the previously block I say yy_pop_state().

With that now I can read structures like that one:

  <page>
    <title>TITLE</title>
      ...
      <text xml:space="preserve">
        ...
        [[Categoria:CATEGORY]]
        ...
        [[en:ENGLISH]]
        ...
      </text>
  </page>

And to do so, I write that Lex code:

%x PAGE TITLE TEXT CATEGORIA EN
%option stack
anything .|[\n\t\r]
notPage ["Wikipedia""Usuário""Discussão""Ajuda""Anexo""MediaWiki""Categoria"]
%%
"<page>"                        yy_push_state(PAGE);
<PAGE>{
        "<title>"{notPagina}    yy_pop_state(); // not a valid page
        "<title>"               yy_push_state(TITLE);
        "<text"[^>]+">"         yy_push_state(TEXT);
        "</page>"               yy_pop_state();
        {anything}              /* do nothing */
}
 
<TEXT>{
        "[["[cC]"ategoria:"     yy_push_state(CATEGORIA);
        "[[en:"                 yy_push_state(EN);
        "</text>"               yy_pop_state();
        {anything}              /* do nothing */
}
 
<TITLE>{
        [^<]+                   {
                                        i=0;
                                        imprime(cat, pt, en);
                                        limpa(cat);
                                        pt=NULL; en=NULL;
                                        pt=strdup(yytext);
                                }
        "</title>"              yy_pop_state();
        {anything}              /* do nothing */
}
 
<EN>{
        [^\]]+                  en=strdup(yytext);
        [\]]+                   yy_pop_state();
        "]"\n{anything}         /* do nothing */
}
 
<CATEGORIA>{
        [ \#\!\*\|]+            yy_pop_state();
        [^\]\|\n]+              {
                                        cat[i]=strdup(yytext);
                                        i++;
                                }
        [\]]+                   yy_pop_state();
        "]"\n{anything}         /* do nothing */
}
{anything}                      /* do nothing */
%%
int main() {
        yylex();
        return 0;
}

As you can see we are all the time matching a rule that makes lex jump to another state until one terminal rule (in this case variables affectations) or until pop.

To see all the code go here(.lex).

If you understand Portuguese and feel interested in more information you can go here(.html).

References

flex: The Fast Lexical Analyzer








Follow

Get every new post delivered to your Inbox.

Join 185 other followers