Top 10 programs – Haskell version

20 04 2008

Last week I saw on Tom Moertel’s Blog the command to show the 10 most used programs:

`history | awk '{print \$2}' |sort | uniq -c | sort -rn | head`

I do not show my top10 because I was expecting to have time to do a Haskell version 🙂

Based on .bashrc of Joachim Breitners I came up with this idea:

`history | hmapw 'drop 1 . take 2' | hmapl List.sort | uniq -c | hmapl 'take 10 . List.reverse . List.sort'`

But this is a kind of cheating, because I’m using the unix `uniq`.

So, I changed Joachim Breitners .bashrc to this:

```...
if which ghc > /dev/null
then
function ust { ghc-6.8.1 "-e interact (\$*)" ~/Ust.hs ; }
function ustmapl { ust "unlines.(\$*).lines" ; }
function ustmapw { ustmapl "map (unwords.(\$*).words)" ; }

function hmap { ghc-6.8.1 "-e interact (\$*)" ;  }
function hmapl { hmap  "unlines.(\$*).lines" ; }
function hmapw { hmapl "map (unwords.(\$*).words)" ; }
fi```

And based on and Unix Simple Tools I did Ust.hs:

```module Ust(tail10, pick, uniq_c) where

import Data.List

-- return the last ten lines of a file
tail10  = drop =<< subtract 10 . length

pick n = (:[]) . (!!n)

uniq_c l = [ nl (tam l) i s | (s,i) <- uniq_c' l]

tam = maximum . map snd . uniq_c'

uniq_c' [] = []
uniq_c' (h:t) = let (list,rest) = span (==h) t
n = length list + 1
in (h,n) : uniq_c' rest

nl tam n line = let l = length \$ show n
l_tam = length \$ show tam
n' = replicate (l_tam-l) " "
in concat n' ++ show n ++ " " ++ line```

And here it is, the Haskell version:

`history | sutmapw 'pick 1' | sutmapl 'reverse . tail10 . sort . uniq_c . sort'`

Explanation

```pick 1 ["a","b","c"] = "b"
reverse [1,2,3] = [3,2,1]
-- 'tail10' return the 10 last elements of a list
-- 'sort' you know...
-- uniq_c = uniq -c (from unix)```

`sutmapl` convert a String (with a lot of ‘n’) to the list of Strings separated by ‘n’.
`sutmapw` converts a list of String in a list of lists of Strings separated by ‘t’ or ‘ ‘.

The composition in Haskell works in the same way as the mathematical composition.
In the Unix console the pipe acts like (;), so we must reverse the parameters of the composition (.) to be equal to (;).

Btw, here is my top10:

``` 623 make 433 sudo 380 cd 247 xpdf 176 ./game 175 ./client 129 man 117 ls 76 ./server 11 history ```

13 04 2008

Intro

This year I started to learn processing languages. I started by regular expressions and past few days I began to study the Flex, as with regular expressions we can’t create text filters.
The first job I did was a kind of a dictionary. Getting a source of words and a faithful translation, add all the information in one document.

The problem was finding a good document with many Portuguese words and their translations into English.

With this post I want to teach what I learned from this work.

Wikipedia XML structure

I started by picking up the last dump of Wikipedia-PT and “decipher” the XML where it is stored. The structure is something like that:

```<page>
...
</page>
<page>
...
</page>
```

And each page tag, expanded, have this structure:

```<page>
<title>TITLE</title>
<id>PAGE_ID_NUMBER</id>
<revision>
<id>ID_REVISION_NUMBER</id>
<timestamp>TIMESTAMP</timestamp>
<contributor>
<id>ID_CONTRIBUTOR</id>
</contributor>
<comment>COMMENT</comment>
<text xml:space="preserve">WIKIPEDIA_ENTRY</text>
</revision>
</page>```

So, here we have the variables: TITLE, PAGE_ID_NUMBER, ID_REVISION_NUMBER, TIMESTAMP, USERNAME, ID_CONTRIBUTOR, COMMENT and WIKIPEDIA_ENTRY.

TITLE is the Portuguese word, because the Wikipedia I’ve downloaded is in Portuguese.
But so far, no English word.

Lets expand WIKIPEDIA_ENTRY:

```<text xml:space="preserve">
...
[[categoria:CATEGORY]]
...
[[en:ENGLISH]]
...
</text>```

Here we have the ENGLISH variable that is the corresponding word in English to TITLE. I also want the CATEGORY variable that indicates from what category this entry belong.

As some entries in the Wikipedia have multiple categories lines I am also interested in keep them all.
I want that the output of my program became something like that:

```PT - TITLE1
EN - ENGLISH1
Categoria - CATEGORY11
CATEGORY12
...
CATEGORY1j
...
PT - TITLEi
EN - ENGLISHi
Categoria - CATEGORYi1
CATEGORYi2
...
CATEGORYij
...```

Some entries in the Portuguese Wikipedia do not have the English correspondent version so I will not want those entries.

Lexing

A Lex file have this aspect:

```definitions
%%
rules
%%
user code
```

Let’s focus in rules part:

Rules have the following structure;

```%%
REGEX    code
REGEX    code
%%
```

I know Regular Expressions (REGEX) now, so let’s start to build that thing!

I got me to realize that the part of the TITLE may have not only the entries name, also has the Wikipedia contributors pages, among other things that do not interest me to save.
I start to do a list of all those pages:
{Wikipedia,Usuário,Discussão,Ajuda,Anexo,MediaWiki,Categoria}

So, I have to, somehow, throw away all the pages with the following structure:

```  <page>
<title>["Wikipedia""Usuario""Discussao""Ajuda""Anexo""MediaWiki""Categoria"]:TITLE</title>
<id>ID_PAGE</id>
<revision>
<id>ID_REVISION</id>
<timestamp>TIMESTAMP</timestamp>
<contributor>
<id>ID_CONTRIBUTOR</id>
</contributor>
<comment>COMMENT</comment>
<text xml:space="preserve">ENTRY</text>
</revision>
</page>
```

After a dozen of lines I start to understand that I have to some how “explain” Lex the structure of the Wikipedia XML file. That way will be easier.

I start to read the Flex manual and I find the Start Conditions, a very clever way to treat a block of information.

Languages like C, HTML and XML are structured by blocks, so Start Conditions may be the way to easily get the information from those.

Start Conditions

If you have a block of code that have this aspect:

```<title>TITLE</title>
```

Our block start with “” and then have a name (that I want) until ‘>’ and then ended with “” string.

So in Lex, we must use Start Conditions and produce the following code:

```%x title_sc
anything .|[\n\t\r]
%%
"<title>"            BEGIN(title_sc);
<title_sc>[^>]+      {printf("title=%s\n",yytext);}
<title_sc>"</title>" BEGIN(INITIAL);
{anything}           {;} /* do nothing*/
%%
main() {
yylex();
}
```

The `%x title_sc` declaration is to declare a state exclusive, that means; while we are inside `code_sc` Flex won’t look rules outside of that, until `BEGIN(INITAIL)`.

In definitions part we can declare variables `anything .|[\n\t\r]` to use in rules part as `{anything}`.

The `BEGIN(title_sc)` statement makes Lex jump to the first line of rule and it start to match the rules that finds there.

We can rewrite the above code like that:

```%x title_sc
anything .|[\n\t\r]
%%
"<title>"            BEGIN(title_sc);
<title_sc>{
[^>]+        {printf("title=%s\n",yytext);}
"</title>"   BEGIN(INITIAL);
}
{anything}           {;} /* do nothing*/
%%
main() {
yylex();
}
```

When Lex find `BEGIN(INITIAL)` statement it jumps to the first `BEGIN(XXX)`, so we can never be able to use block’s inside other block’s (like XML does).

Of course that’s not true.

Start Conditions inside Start Conditions

Lex have a brilliant way to deal with that. It uses Stack’s to store the state were we are!

The idea is something like that, imagine a mathematical expression:

$(1+2)-(3*(4/5))$

I can say:

• 2 or [1,2]
• 3 or [2,1]
• 5 or [2,2,2]
• and so on…

That’s all about keeping the path, our actual position in the tree.

So, now we replace the `BEGIN(state)` to `yy_push_state(state)`, and to go to the previously block I say `yy_pop_state()`.

With that now I can read structures like that one:

```  <page>
<title>TITLE</title>
...
<text xml:space="preserve">
...
[[Categoria:CATEGORY]]
...
[[en:ENGLISH]]
...
</text>
</page>
```

And to do so, I write that Lex code:

```%x PAGE TITLE TEXT CATEGORIA EN
%option stack
anything .|[\n\t\r]
notPage ["Wikipedia""Usuário""Discussão""Ajuda""Anexo""MediaWiki""Categoria"]
%%
"<page>"                        yy_push_state(PAGE);
<PAGE>{
"<title>"{notPagina}    yy_pop_state(); // not a valid page
"<title>"               yy_push_state(TITLE);
"<text"[^>]+">"         yy_push_state(TEXT);
"</page>"               yy_pop_state();
{anything}              /* do nothing */
}

<TEXT>{
"[["[cC]"ategoria:"     yy_push_state(CATEGORIA);
"[[en:"                 yy_push_state(EN);
"</text>"               yy_pop_state();
{anything}              /* do nothing */
}

<TITLE>{
[^<]+                   {
i=0;
imprime(cat, pt, en);
limpa(cat);
pt=NULL; en=NULL;
pt=strdup(yytext);
}
"</title>"              yy_pop_state();
{anything}              /* do nothing */
}

<EN>{
[^\]]+                  en=strdup(yytext);
[\]]+                   yy_pop_state();
"]"\n{anything}         /* do nothing */
}

<CATEGORIA>{
[ \#\!\*\|]+            yy_pop_state();
[^\]\|\n]+              {
cat[i]=strdup(yytext);
i++;
}
[\]]+                   yy_pop_state();
"]"\n{anything}         /* do nothing */
}
{anything}                      /* do nothing */
%%
int main() {
yylex();
return 0;
}
```

As you can see we are all the time matching a rule that makes lex jump to another state until one terminal rule (in this case variables affectations) or until pop.

To see all the code go here(.lex).

If you understand Portuguese and feel interested in more information you can go here(.html).

References

flex: The Fast Lexical Analyzer

Dave Matthwes Band

13 04 2008

É a única banda que gosto de TODAS as músicas. Mas que Grande Banda! Aconselho vivamente a quem não conhece a ouvir o máximo que conseguir dos álbuns ao vivo. São 9, dos quais destaco The Central Park Concert de 2003. Se puderem arranjem o DVD e vejam estes senhores em palco. Brutal!

Life after people – Documentary

12 04 2008

Last week, after making a work for the university, I was stumbling around and found an excellent documentary to see. Life after people is a documentary from History channel that try to respond the question:

What would happen to planet earth if the human race were to suddenly disappear forever?

The idea is just great! Probably everyone at some point thought about that, and maybe didn’t have the imagination or the knowledge to get a clear picture like those in this documentary.

The documentary have a lot of speculative facts, just a little bit americanized, but the idea of see all the things that we created being destroyed by the land there we live is fascinating.

Remember us that our planet is alive, and live by it self, and in some aspects needs help. We are small but at the same time we can do a lot of changes in our planet.

Well, I really liked this imaginative documentary, here is the “trailer”:

Here the oficial site of the documentary.

Eduroam em Linux

12 04 2008

Anteriormente já tinha falado sobre este tema. No entanto, actualmente uso uma maneira que me agrada mais do que a que expliquei anteriormente e gostava de partilhar com quem estiver interessado.

No Linux existe um ficheiro para guardar as configurações dos interfaces de rede, /etc/network/interfaces, por isso vamos começar por o editar:

# vi /etc/network/interfaces

Neste ficheiro adicionamos as seguintes linhas no fim:

iface eduroam inet dhcp
wpa-driver wext
wpa-ssid eduroam
wpa-eap TTLS
wpa-phase2 auth=PAP
wpa-group TKIP WEP104
wpa-pairwise TKIP WEP104
wpa-key-mgmt WPA-EAP
wpa-ap_scan 1

Todas estas configurações são exactamente as mesmas que as que referi no post antigo, que estavam no ficheiro wpa_supplicat.conf.

Mas é agora que eu prefiro estas configurações às anteriores.

Agora só é preciso fazer:

\$ sudo ifdown eduroam

Para desligar o bloco de configurações acima mostrado.

\$ sudo ifup eduroam

Para nos ligarmos à rede eduroam.

🙂