Top 10 programs – Haskell version

20 04 2008

Last week I saw on Tom Moertel’s Blog the command to show the 10 most used programs:

history | awk '{print $2}' |sort | uniq -c | sort -rn | head

I do not show my top10 because I was expecting to have time to do a Haskell version ūüôā

Based on .bashrc of Joachim Breitners I came up with this idea:

history | hmapw 'drop 1 . take 2' | hmapl List.sort | uniq -c | hmapl 'take 10 . List.reverse . List.sort'

But this is a kind of cheating, because I’m using the unix uniq.

So, I changed Joachim Breitners .bashrc to this:

...
if which ghc > /dev/null
then
        function ust { ghc-6.8.1 "-e interact ($*)" ~/Ust.hs ; }
        function ustmapl { ust "unlines.($*).lines" ; }
        function ustmapw { ustmapl "map (unwords.($*).words)" ; }

        function hmap { ghc-6.8.1 "-e interact ($*)" ;  }
        function hmapl { hmap  "unlines.($*).lines" ; }
        function hmapw { hmapl "map (unwords.($*).words)" ; }
fi

And based on and Unix Simple Tools I did Ust.hs:

module Ust(tail10, pick, uniq_c) where

import Control.Monad.Instances
import Data.List

-- @http://www.haskell.org/haskellwiki/Simple_unix_tools
-- return the last ten lines of a file
tail10  = drop =<< subtract 10 . length

pick n = (:[]) . (!!n)

uniq_c l = [ nl (tam l) i s | (s,i) <- uniq_c' l]

tam = maximum . map snd . uniq_c'

uniq_c' [] = []
uniq_c' (h:t) = let (list,rest) = span (==h) t
                    n = length list + 1
                in (h,n) : uniq_c' rest

nl tam n line = let l = length $ show n
                    l_tam = length $ show tam
                    n' = replicate (l_tam-l) " "
                in concat n' ++ show n ++ " " ++ line

And here it is, the Haskell version:

history | sutmapw 'pick 1' | sutmapl 'reverse . tail10 . sort . uniq_c . sort'

Explanation

pick 1 ["a","b","c"] = "b"
reverse [1,2,3] = [3,2,1]
-- 'tail10' return the 10 last elements of a list
-- 'sort' you know...
-- uniq_c = uniq -c (from unix)

sutmapl convert a String (with a lot of ‘n’) to the list of Strings separated by ‘n’.
sutmapw converts a list of String in a list of lists of Strings separated by ‘t’ or ‘ ‘.

The composition in Haskell works in the same way as the mathematical composition.
In the Unix console the pipe acts like (;), so we must reverse the parameters of the composition (.) to be equal to (;).

Btw, here is my top10:


623 make
433 sudo
380 cd
247 xpdf
176 ./game
175 ./client
129 man
117 ls
76 ./server
11 history





Lexical analysis Wikipedia

13 04 2008

Intro

This year I started to learn processing languages. I started by regular expressions and past few days I began to study the Flex, as with regular expressions we can’t create text filters.
The first job I did was a kind of a dictionary. Getting a source of words and a faithful translation, add all the information in one document.

The problem was finding a good document with many Portuguese words and their translations into English.

With this post I want to teach what I learned from this work.

Wikipedia XML structure

I started by picking up the last dump of Wikipedia-PT and “decipher” the XML where it is stored. The structure is something like that:

<page>
    ...
</page>
<page>
    ...
</page>

And each page tag, expanded, have this structure:

<page>
    <title>TITLE</title>
    <id>PAGE_ID_NUMBER</id>
    <revision>
      <id>ID_REVISION_NUMBER</id>
      <timestamp>TIMESTAMP</timestamp>
      <contributor>
        <username>USERNAME</username>
        <id>ID_CONTRIBUTOR</id>
      </contributor>
      <comment>COMMENT</comment>
      <text xml:space="preserve">WIKIPEDIA_ENTRY</text>
    </revision>
  </page>

So, here we have the variables: TITLE, PAGE_ID_NUMBER, ID_REVISION_NUMBER, TIMESTAMP, USERNAME, ID_CONTRIBUTOR, COMMENT and WIKIPEDIA_ENTRY.

TITLE is the Portuguese word, because the Wikipedia I’ve downloaded is in Portuguese.
But so far, no English word.

Lets expand WIKIPEDIA_ENTRY:

<text xml:space="preserve">
    ...
    [[categoria:CATEGORY]]
    ...
    [[en:ENGLISH]]
    ...
</text>

Here we have the ENGLISH variable that is the corresponding word in English to TITLE. I also want the CATEGORY variable that indicates from what category this entry belong.

As some entries in the Wikipedia have multiple categories lines I am also interested in keep them all.
I want that the output of my program became something like that:

PT - TITLE1
EN - ENGLISH1
Categoria - CATEGORY11
            CATEGORY12
            ...
            CATEGORY1j
...
PT - TITLEi
EN - ENGLISHi
Categoria - CATEGORYi1
            CATEGORYi2
            ...
            CATEGORYij
...

Some entries in the Portuguese Wikipedia do not have the English correspondent version so I will not want those entries.

Lexing

A Lex file have this aspect:

definitions
%%
rules
%%
user code

Let’s focus in rules part:

Rules have the following structure;

%%
REGEX    code
REGEX    code
%%

I know Regular Expressions (REGEX) now, so let’s start to build that thing!

I got me to realize that the part of the TITLE may have not only the entries name, also has the Wikipedia contributors pages, among other things that do not interest me to save.
I start to do a list of all those pages:
{Wikipedia,Usu√°rio,Discuss√£o,Ajuda,Anexo,MediaWiki,Categoria}

So, I have to, somehow, throw away all the pages with the following structure:

  <page>
    <title>["Wikipedia""Usuario""Discussao""Ajuda""Anexo""MediaWiki""Categoria"]:TITLE</title>
    <id>ID_PAGE</id>
    <revision>
      <id>ID_REVISION</id>
      <timestamp>TIMESTAMP</timestamp>
      <contributor>
        <username>USERNAME</username>
        <id>ID_CONTRIBUTOR</id>
      </contributor>
      <comment>COMMENT</comment>
      <text xml:space="preserve">ENTRY</text>
    </revision>
  </page>

After a dozen of lines I start to understand that I have to some how “explain” Lex the structure of the Wikipedia XML file. That way will be easier.

I start to read the Flex manual and I find the Start Conditions, a very clever way to treat a block of information.

Languages like C, HTML and XML are structured by blocks, so Start Conditions may be the way to easily get the information from those.

Start Conditions

If you have a block of code that have this aspect:

<title>TITLE</title>

Our block start with “” and then have a name (that I want) until ‘>’ and then ended with “” string.

So in Lex, we must use Start Conditions and produce the following code:

%x title_sc
anything .|[\n\t\r]
%%
"<title>"            BEGIN(title_sc);
<title_sc>[^>]+      {printf("title=%s\n",yytext);}
<title_sc>"</title>" BEGIN(INITIAL);
{anything}           {;} /* do nothing*/
%%
main() {
        yylex();
}

The %x title_sc declaration is to declare a state exclusive, that means; while we are inside code_sc Flex won’t look rules outside of that, until BEGIN(INITAIL).

In definitions part we can declare variables anything .|[\n\t\r] to use in rules part as {anything}.

The BEGIN(title_sc) statement makes Lex jump to the first line of rule and it start to match the rules that finds there.

We can rewrite the above code like that:

%x title_sc
anything .|[\n\t\r]
%%
"<title>"            BEGIN(title_sc);
<title_sc>{
        [^>]+        {printf("title=%s\n",yytext);}
        "</title>"   BEGIN(INITIAL);
}
{anything}           {;} /* do nothing*/
%%
main() {
        yylex();
}

When Lex find BEGIN(INITIAL) statement it jumps to the first BEGIN(XXX), so we can never be able to use block’s inside other block’s (like XML does).

Of course that’s not true.

Start Conditions inside Start Conditions

Lex have a brilliant way to deal with that. It uses Stack’s to store the state were we are!

The idea is something like that, imagine a mathematical expression:

(1+2)-(3*(4/5))

I can say:

  • 2 or [1,2]
  • 3 or [2,1]
  • 5 or [2,2,2]
  • and so on…

That’s all about keeping the path, our actual position in the tree.

So, now we replace the BEGIN(state) to yy_push_state(state), and to go to the previously block I say yy_pop_state().

With that now I can read structures like that one:

  <page>
    <title>TITLE</title>
      ...
      <text xml:space="preserve">
        ...
        [[Categoria:CATEGORY]]
        ...
        [[en:ENGLISH]]
        ...
      </text>
  </page>

And to do so, I write that Lex code:

%x PAGE TITLE TEXT CATEGORIA EN
%option stack
anything .|[\n\t\r]
notPage ["Wikipedia""Usu√°rio""Discuss√£o""Ajuda""Anexo""MediaWiki""Categoria"]
%%
"<page>"                        yy_push_state(PAGE);
<PAGE>{
        "<title>"{notPagina}    yy_pop_state(); // not a valid page
        "<title>"               yy_push_state(TITLE);
        "<text"[^>]+">"         yy_push_state(TEXT);
        "</page>"               yy_pop_state();
        {anything}              /* do nothing */
}
 
<TEXT>{
        "[["[cC]"ategoria:"     yy_push_state(CATEGORIA);
        "[[en:"                 yy_push_state(EN);
        "</text>"               yy_pop_state();
        {anything}              /* do nothing */
}
 
<TITLE>{
        [^<]+                   {
                                        i=0;
                                        imprime(cat, pt, en);
                                        limpa(cat);
                                        pt=NULL; en=NULL;
                                        pt=strdup(yytext);
                                }
        "</title>"              yy_pop_state();
        {anything}              /* do nothing */
}
 
<EN>{
        [^\]]+                  en=strdup(yytext);
        [\]]+                   yy_pop_state();
        "]"\n{anything}         /* do nothing */
}
 
<CATEGORIA>{
        [ \#\!\*\|]+            yy_pop_state();
        [^\]\|\n]+              {
                                        cat[i]=strdup(yytext);
                                        i++;
                                }
        [\]]+                   yy_pop_state();
        "]"\n{anything}         /* do nothing */
}
{anything}                      /* do nothing */
%%
int main() {
        yylex();
        return 0;
}

As you can see we are all the time matching a rule that makes lex jump to another state until one terminal rule (in this case variables affectations) or until pop.

To see all the code go here(.lex).

If you understand Portuguese and feel interested in more information you can go here(.html).

References

flex: The Fast Lexical Analyzer





Dave Matthwes Band

13 04 2008


√Č a √ļnica banda que gosto de TODAS as m√ļsicas. Mas que Grande Banda! Aconselho vivamente a quem n√£o conhece a ouvir o m√°ximo que conseguir dos √°lbuns ao vivo. S√£o 9, dos quais destaco The Central Park Concert de 2003. Se puderem arranjem o DVD e vejam estes senhores em palco. Brutal!





Life after people – Documentary

12 04 2008

Last week, after making a work for the university, I was stumbling around and found an excellent documentary to see. Life after people is a documentary from History channel that try to respond the question:

What would happen to planet earth if the human race were to suddenly disappear forever?

The idea is just great! Probably everyone at some point thought about that, and maybe didn’t have the imagination or the knowledge to get a clear picture like those in this documentary.

The documentary have a lot of speculative facts, just a little bit americanized, but the idea of see all the things that we created being destroyed by the land there we live is fascinating.

Remember us that our planet is alive, and live by it self, and in some aspects needs help. We are small but at the same time we can do a lot of changes in our planet.

Well, I really liked this imaginative documentary, here is the “trailer”:

Here the oficial site of the documentary.





Eduroam em Linux

12 04 2008

Anteriormente j√° tinha falado sobre este tema. No entanto, actualmente uso uma maneira que me agrada mais do que a que expliquei anteriormente e gostava de partilhar com quem estiver interessado.

No Linux existe um ficheiro para guardar as configura√ß√Ķes dos interfaces de rede, /etc/network/interfaces, por isso vamos come√ßar por o editar:

# vi /etc/network/interfaces

Neste ficheiro adicionamos as seguintes linhas no fim:

iface eduroam inet dhcp
wpa-driver wext
wpa-ssid eduroam
wpa-eap TTLS
wpa-identity <IDENTIDADE>
wpa-phase2 auth=PAP
wpa-group TKIP WEP104
wpa-pairwise TKIP WEP104
wpa-key-mgmt WPA-EAP
wpa-password <PASSWORD>
wpa-ca_cert <PATH_PARA_O_CERTIFICADO>
wpa-ap_scan 1

Todas estas configura√ß√Ķes s√£o exactamente as mesmas que as que referi no post antigo, que estavam no ficheiro wpa_supplicat.conf.

Mas √© agora que eu prefiro estas configura√ß√Ķes √†s anteriores.

Agora só é preciso fazer:

$ sudo ifdown eduroam

Para desligar o bloco de configura√ß√Ķes acima mostrado.

$ sudo ifup eduroam

Para nos ligarmos à rede eduroam.

ūüôā