Capture and dissect network traffic

1 04 2009

Currently I am doing research at the University of Minho in the group of distributed systems, with duration of one year. My job is to find a way to identify specific links between a user and a distributed system. The general idea is to draw a map of services in a distributed system. This post only refers to the first milestone.

The proposal was to make such a system using Snort.


Snort is a Network intrusion detection system, that means with Snort you can detect malicious activity in your network. We can detect many types of network attacks. We can identify DoS, DDoS attacks, port scans, cracking attempts, and much more.

Snort can operate in two different ways. We can set up Snort to run in passive mode, putting it to listen in promiscuous mode. That is, because Ethernet network switches send traffic to all computers connected to itself, we get traffic addressed to other machines on the network. To do this we only need to connect to the network and turn Snort on in our machine, no one knows that we are recording every traffic (including traffic destined for other computers).

Snort may also run in active mode. This “active” is not able to modify the data channel, but to be able to be installed in a network, a router for example and reap more information than in passive mode. Thus it makes sense to use the capacity of rules that Snort supports, to filter the traffic that it read.

To do this, Snort capture all packets that pass the network and interprets each. As the rules we have defined Snort tries to find these patterns in each packet, or each set of packets and take certain actions for each of them.

For example, if a large number of TCP requests reach a particular host, to a large number of ports in a short space of time we probably are the target of a port scan. NIDS like Snort know how to find these patterns and alerting the network administrator.


Our aim was to use Snort to capture all traffic into passive mode.

root@pig:# snort -u snort -g snort -D -d -l /var/log/snort -c /etcsnort/snort.debian.conf -i eth0

We are saving the logs in binary (tcpdump format), for that I use the “-d -l /dir/” flags. I prefer to save all the packets into binary because is more easier to parse, than the structure of files and directories that Snort creates by default.

I started by trying to use some language that advised me to try to do the parsing of the file created by snort. Initially started to use python, but only find a tcpdump parser and could not get more than one file translated in tcpdump to hexadecimal.
After that I tried to use Haskell and I was amazed!

Haskell and packet parsing

House is a Haskell Operative System done by The Programatica Project.

This is a system than can serve as a platform for exploring various ideas relating to low-level and system-level programming in a high-level functional language.

And indeed helped me a lot in doing my job. This project have already done a lot of parsers for network packets. It implements the Ethernet, IPv4, IPv6, TCP, UDP, ICMP, ARP and I think is all.

The libpcap (tcpdump parser) is already implemented in Haskell too, so is very simple to parse a complete packet:

getPacket :: [Word8] -> InPacket
getPacket bytes =  toInPack $ listArray (0,Prelude.length bytes-1) $ bytes

-- Ethernet | IP | TCP | X
getPacketTCP :: [Word8] -> Maybe (NE.Packet (NI4.Packet (NT.Packet InPacket)))
getPacketTCP bytes = doParse $ getPacket bytes :: Maybe (NE.Packet (NI4.Packet (NT.Packet InPacket)))

As you can see is too easy to have a compete structure of a packet parsed with this libraries. The problem is that they don’t have already implemented a application packet parser. So, according to that image:

This is the level of depth we can go with this libraries. What is very good, but not perfect for me :S

My supervisor told me to start searching a new tool to do this job. I was sad because I could not do everything in Haskell. But it is already promised that I will continue this project in Haskell. You can see the git repo here.

I find tshark, a great tool to dissect and analyze data inside tcpdump files.

The power of tshark

tshark is the terminal based Wireshark, with it we can do everything we do with wireshark.

Show all communications with the IP

root@pig:# tshark -R "ip.addr ==" -r snort.log
7750 6079.816123 -> SSHv2 Client: Key Exchange Init
7751 6079.816151 -> TCP ssh > 51919 [ACK] Seq=37 Ack=825 Win=7424 Len=0 TSV=131877388 TSER=1789588
7752 6079.816528 -> SSHv2 Server: Key Exchange Init
7753 6079.817450 -> TCP 51919 > ssh [ACK] Seq=825 Ack=741 Win=7264 Len=0 TSV=1789588 TSER=131877389
7754 6079.817649 -> SSHv2 Client: Diffie-Hellman GEX Request
7755 6079.820784 -> SSHv2 Server: Diffie-Hellman Key Exchange Reply
7756 6079.829495 -> SSHv2 Client: Diffie-Hellman GEX Init
7757 6079.857490 -> SSHv2 Server: Diffie-Hellman GEX Reply
7758 6079.884000 -> SSHv2 Client: New Keys
7759 6079.922576 -> TCP ssh > 51919 [ACK] Seq=1613 Ack=1009 Win=8960 Len=0 TSV=131877415 TSER=1789605

Show with a triple: (time, code http, http content size), separated by ‘,’ and between quotation marks.

root@pig:# tshark -r snort.log -R http.response -T fields -E header=y -E separator=',' -E quote=d -e frame.time_relative -e http.response.code -e http.content_length

Show a tuple of arity 4 with: (time, source ip, destination ip, tcp packet size).

root@pig:# tshark -r snort.log -R "tcp.len>0" -T fields -e frame.time_relative -e ip.src -e ip.dst -e tcp.len
551.751252000  48
551.751377000   144
551.961545000  48
551.961715000   208
552.682260000  48
552.683955000   1448
552.683961000   1448
552.683967000   512
555.156301000  48
555.158474000   1448
555.158481000   1400
556.021205000  48
556.021405000   160
558.874202000  48
558.876027000   1448

Show with a triple: (source ip, destination ip, port of destination ip).

root@pig:# tshark -r snort.log -Tfields  -e ip.src -e ip.dst -e tcp.dstport
...   37602   37602  22   37602  22  22   37602   37602   37602  22  22  22  22   37602   37602


Hierarchy of protocols

root@pig:# tshark -r snort.log -q -z io,phs
frame                                    frames:7780 bytes:1111485
  eth                                    frames:7780 bytes:1111485
    ip                                   frames:3992 bytes:848025
      tcp                                frames:3908 bytes:830990
        ssh                              frames:2153 bytes:456686
        http                             frames:55 bytes:19029
          http                           frames:5 bytes:3559
            http                         frames:3 bytes:2781
              http                       frames:2 bytes:2234
                http                     frames:2 bytes:2234
          data-text-lines                frames:10 bytes:5356
        tcp.segments                     frames:3 bytes:1117
          http                           frames:3 bytes:1117
            media                        frames:3 bytes:1117
      udp                                frames:84 bytes:17035
        nbdgm                            frames:50 bytes:12525
          smb                            frames:50 bytes:12525
            mailslot                     frames:50 bytes:12525
              browser                    frames:50 bytes:12525
        dns                              frames:34 bytes:4510
    llc                                  frames:3142 bytes:224934
      stp                                frames:3040 bytes:182400
      cdp                                frames:102 bytes:42534
    loop                                 frames:608 bytes:36480
      data                               frames:608 bytes:36480
    arp                                  frames:38 bytes:2046


We use: -z conv,TYPE,FILTER

TYPE could be:

  • eth,
  • tr,
  • fc,
  • fddi,
  • ip,
  • ipx,
  • tcp,
  • udp

And the filters are used to restrict the statistics.

root@pig:# tshark -r snort.log -q -z conv,ip,tcp.port==80
IPv4 Conversations
                                |           | |    Total    |
                                |Frames Bytes | |Frames Bytes | |Frames Bytes | 141    13091    202   259651    343   272742     22     6858     28     4784     50    11642


We use: -z io,stat,INT,FILTER,…,FILTER

root@pig:# tshark -r snort.log -q -z io,stat,300,'not (tcp.port=22)'
IO Statistics
Interval: 300.000 secs
Column #0:
                |   Column #0
Time            |frames|  bytes
000.000-300.000    2161    543979
300.000-600.000    1671    264877
600.000-900.000     508     46224
900.000-1200.000     185     12885
1200.000-1500.000     201     14607
1500.000-1800.000     187     13386
1800.000-2100.000     189     13887
2100.000-2400.000     187     13386
2400.000-2700.000     189     13887
2700.000-3000.000     187     13386
3000.000-3300.000     185     12885
3300.000-3600.000     189     13887
3600.000-3900.000     210     15546
3900.000-4200.000     189     13887
4200.000-4500.000     187     13386
4500.000-4800.000     185     12885
4800.000-5100.000     189     13887


With tshark we could do everything we want to know what is inside a network packet. The trick is to understand the statistics that tshark generate, and know how to ask it.

Now my work will get a machine to run Snort in an active mode and begin to understand how to use Snort to do all this work of collecting information.

If you feel interested and understand Portuguese, see the presentation:


SSH Login Attempts

11 01 2009

Back with honeypot news! We have launched our honeypot for 5 weeks, and now we have results to show you. In this post I will show you the attempts that attackers make to get into our ssh honeypot server.

The ssh honeypot was fustigated during these 5 weeks. Several attempts were made, about 78227, but no one successful.

Here is the graphic for usernames attempts:

And here is the graphic for password attempts:

Future Work

We will show all the rest of information that we capture on our honeypot in the future. We have discovered great stuff.
I have also done a nice program to generate statistics in Haskell using HaskellCharts, I will talk about that later too.

That’s all for now!

Secure connections to MySQL

18 11 2008

Together with Pedro Pereira we decided to investigate how MySQL make secure connections with clients. This is the first milestone of our msc in Cryptography.
It was proposed that we investigate the internal authentication process that MySQL do using X.509 certificates format.

This post gives a short introduction to tools and methods we use, Public-key cryptography, Certificates, OpenSSL, MySQL and VirtualBox.

We use the VirtualBox to install mysql, to avoid installing it in our OS. So, all the commands showed here have to maked in this virtual machine.

Configuring VirtualBox

As we said before, we installed MySQL in a virtual machine, so we decided access the virtual machine by ssh and remote connections to

NAT vs Port forward

By default the network connection in VirtualBox is made by Network Address Translation (NAT), i.e. each package that is sent by the guest machine is modified so that it appears to come from the host machine. Thus it is very easy to guest machine to connect with the entire network (including Internet), but never could start a connection from host machine to guest machine, since the interface of the guest
is hidden by the host machine.

To resolve this issue, and can access from host machine to the guest by ssh and the MySQL we decided to use the Port forward system that VirtualBox offers.
We have the guest machine running a ssh service accepting connections on port 22. Our goal is to make each package reaches a certain TCP port (eg 2222) on the host machine to to be redirected to TCP port 22 in guest machine.

The command that allows us to do this in VirtualBox is: VBoxManage. We make this with following commands, in which would be the name we gave to our guest machine:

shell> VBoxManage setextradata 
        "VBoxInternal/Devices/pcnet/0/LUN#0/Config/ssh/HostPort" 2222
shell> VBoxManage setextradata 
        "VBoxInternal/Devices/pcnet/0/LUN#0/Config/ssh/GuestPort" 22
shell> VBoxManage setextradata 
        "VBoxInternal/Devices/pcnet/0/LUN#0/Config/ssh/Protocol" TCP

From now every time we want to connect by ssh to the guest machine only run the following command in a shell:

shell> ssh -l  -p 2222 localhost

Similarly the same happens with MySQL connections. We want all packages targeted to port 3333 on host machine is redirected to the port 3306 of guest machine . So being able to access the MySQL that is installed on the guest machine:

shell> VBoxManage setextradata 
        "VBoxInternal/Devices/pcnet/0/LUN#0/Config/mysql/HostPort" 3333
shell> VBoxManage setextradata 
        "VBoxInternal/Devices/pcnet/0/LUN#0/Config/mysql/GuestPort" 3306
shell> VBoxManage setextradata 
        "VBoxInternal/Devices/pcnet/0/LUN#0/Config/mysql/Protocol" TCP

Public-key cryptography

The cryptography asymmetric system can be explained with the following analogy: a mailbox,
is accessible to the public through its address (public key), then anyone can send a
message for this box. Just who has the key to the box is the only who can read the messages (private key).

We only guarantee that any person can send encrypted messages to the owner of the mailbox. However we could not guarantee the identity of who recieve the message (the private key may have been compromised). We also can not guarantee the identity of the person who sent (Later we will see that the use of certificates resolve this problem).

For Bob send a message to Alice, he uses her public key to encrypt the message. This cryptogram is sent to Alice that decrypt with her private key.
Later, the Alice responds to Bob, encrypting the message with his public key.

In a different kind of use of public key can have a scenario in which Alice communicate with Bob, encrypts the message with her private key (digital signature) and encrypt it again with Bob’s public key. Thus, on the other side of the channel, Bob uses his private key and
subsequent Alice’s public key, thus obtaining the original clear text.

The cryptogram generated is an example of a symmetric cipher and is more robust than the previous scenario where only one key is used each time. Imagine now a case in which a third malicious actor, publish your public key and claim to be Alice. Thus it is likely someone who cheat and can read some of the messages intended for Alice. Although we have secure connections/strongly encrypted messages, there is still no guarantee the identity of any of the actors in the process of communication. In this context, we use X.509 certificates


An X.509 certificate of public key is an electronic document that can be compared to Identity card. However, instead of attaching a photo to the name of the person, it combines the key to their own (identity). But the certificate may not be issued by the concerned stakeholders because any one could falsify one certificate and claiming a false identity.

There we need that there is an entity (Certification Authority) trusted by both sides to ensure the identities of both. The CA ‘s sign (encrypt) the certificates with theirs private keys allowing validity to who decrypt the signature with their public key CA’s.

But if go to the top of hierarchy in the chain of certificates we will face a problem: who signs the CA certificate? The bottom line is we have always to belive in a entity, now users no longer communicate among themselves but the CA’s do that. The CA ‘s can sign their certificates, an example of a self-signed certificate that normally is a root Certificate.

Installing MySQL

The process we used here, was tested in a machine with Ubuntu 7.10 and 8.04:

shell> apt -get install mysql-server-5.0 mysql-client-5.0

The OpenSSL came compiled by default in the .deb package, but if we have to compile it we only would have to specify the following in the process of setting up the Makefile:

shell> ./ configure --with - openssl

Now, that we have instaled MySQL, we can go into it typing:

shell> mysql -h SERVER -u root -p

This way we got an uncrypted connection to the server, to obtain an encrypted you must add the option –ssl. This option when introduced on the server side means that the server will allow secure connections, in client-side allows to connect to the server via a secure connection. But this option alone is not enough, it is also necessary to introduce –ssl-ca and possibly the –ssl-cert and –ssl-key.
We have to enter with this flags if we not set the appropriate paths of certificates and their keys in the file /etc/mysql/my.conf.

But we’ll see below in more detail how to use these options, now just want to add a user “user” with the password “passwd” in the database “dBASE” located in “” demanding an SSL connection:

mysql > GRANT ALL PRIVILEGES ON dbase .* TO ’user’@’’
       IDENTIFIED BY ’passwd ’
       REQUIRE SUBJECT ’/CN=user ’
       AND ISSUER ’/CN=CA ’

The CIPHER part means the ciphers used for encryption and you should pick up the ciphers stronger because MySQL can use weaker ciphers.

Now, we get out of the MySQL administration program to demonstrate how to generate keys and certificates.

Generate certificates

We will demonstrate how to create a fictitious CA, generate certificates of potential clients/servers and pointed through the private key of CA, just like real in the process. First we create a tree of folders to contain the structuring of certificates:

shell> mkdir -m 755 

The CA folder represents the folder of our certification authority, the private folder will hold private keys; certs folder will have the clients/servers certificates, the newcerts is a required folder for the OpenSSL to store decrypted certificates, whose names will be their serial numbers; finally crl folder will keep the list of revoked certificates.
Now copy the default OpenSSL configuration file to our CA folder:

shell> cp /etc/ssl/openssl .cnf ~/teste/CA/myopenssl .cnf

and we change permission, allowing only the user can read and write:

shell> chmod 600 ~/teste/CA/myopenssl .cnf

We need to create two files, one will be the OpenSSl database:

shell> touch ~/teste/CA/index.txt

and the other, containing the serial numbers of each certificate. We don’t have anyone, so we put “01” in that file:

shell> echo '01' > ~/teste/CA/serial

Now run all commands in the folder ~/test/CA because is there we have the OpenSSL configuration file. The next step is to generate the self-signed CA certificate: generate the CA private key of 2048 bits (Today, less than 2048 bits is no longer considered completely safe).

shell> openssl genrsa -out private/ca-privkey.key 2048

if we want to check the contents of the key:

shell> openssl rsa -text -in private/ca-privkey.key

and if just generate a public key from private key:

shell> openssl rsa -pubout -in private/ca-privkey.key -out ca-publkey.key

Now we generate the certificate (valid for 365 days) and their public key and through private key we signed it:

shell> openssl req -config myopenssl.cnf -new -x509 -extensions v3_ca
        -key private/ca-privkey.key -out certs/ca-cert.crt -days 365

Note that the “Common Name” (CN) is the identifier that distinguishes the entity/person therefore has to be well written. In this case
CN = CA.

Now, if we want to verify the content of the certificate:

shell> openssl x509 -in certs/ca-cert.crt -noout -text

The private key must be stored under very strong permissions, only the root should be able to read it:

shell> chmod 400 private/ca-privkey.key

Then we change the OpenSSL configuration file (myopenssl.cnf) so that we have this information:

[ CA_default ]
dir              = .
certs            = $dir/certs
crl_dir          = $dir/crl
database         = $dir/index.txt
# unique_subject = no
new_certs_dir    = $dir/newcerts
certificate      = $dir/certs/myca.crt
serial           = $dir/serial
# crlnumber      = $dir/crlnumber
crl              = $dir/crl.pem
private_key      = $dir/private/myca.key
RANDFILE         = $dir/private/.rand
x509_extensions  = usr_cert

Now we can produce the client/server certificate:
we generate the private key and certificate request with the public key:

shell> openssl req -config myopenssl.cnf -new -newkey rsa:2048
        -nodes -keyout private/privkey.key -out cert-req.csr

Then we change the permissions of the new key as before. Note that the “Common Name” (CN) is the identifier that distinguishes a person/entity therefore has to be well written.
In this case CN = user.

we can verify the content of the request:

shell> openssl req -in cert-req.csr -noout -text

And with this command we sign the certificate:

shell> openssl ca -config myopenssl.cnf -cert certs/ca-cert.csr
        -keyfile private/ca-privkey.key -out certs/cert.crt
        -infiles cert-req.csr

This last command creates two additional files on certs folder. The cert.crt (signed certificate) and newcerts/01.pem (decrypted certificate). Naturally we would have to repeat the process for similar entity (client/server).
Right now we’re ready to connect with MySQL.

Connecting to MySQL

The cryptographic methods discussed in the first part of this port are situated in a context of communication. However there are many situations where we need to ensure a secure connection. One of those situations: you may want to connect to a remote database.

When accessing to a remote database anyone with access to the same network can inspect all traffic or worse, change it while passing between the client and server. We can however, use the option –compress on the client side to compress the traffic but still unencrypted and unsafe.
But as we said earlier, MySQL supports encrypted connections through the use of libraries of OpenSSL. Here we can see the MySQL Makefile’s SSL section:

Ln 318: openssl_includes = @openssl_includes@
Ln 319: openssl_libs = @openssl_libs@

So any kind of encryption/maintenance of certificates in MySQL is controlled by the functions that are part of the OpenSSL API.

Configuring SSL in MySQL

To ensure the authenticity can be assured we add the following lines to /etc/mysql/my.conf:

[ client ]
ssl -ca=/home/user/teste/certs/ca-cert.crt
ssl -cert =/home/user/teste/certs/cert.crt       #(client)
ssl -key =/home/user/teste/private/privkey.key   #(client)
[ mysqld ]
ssl -ca=/home/user/teste/certs/ca-cert.crt
ssl -cert =/home/user/teste/certs/cert.crt       #(server)
ssl -key =/home/user/teste/private/privkey.key   #(server)

Consider the initial situation in the role of client, we can access to the server, but now in a secure way. Then:

shell> mysql -h SERVER -u USER -p --ssl

If everything went well we now can connect via a secure connection and authenticated using X.509 certificates.

mysql > show variables like '%ssl%';
| Variable_name | Value                                  |
| have_openssl  | YES                                    |
| have_ssl      | YES                                    |
| ssl_ca        | /home/user/test/certs/ca-cert.crt      |
| ssl_capath    |                                        |
| ssl_cert      | /home/user/test/certs/server-cert.crt  |
| ssl_cipher    |                                        |
| ssl_key       | /home/user/test/private/server-key.key |
7 rows in set (0.11 sec)

As a final note, of this part, we mention that the whole process of this part refers to only one user, to another we must repeat everything, of course.

SSL Program

As extra, we decide to implement a simple program that use SSL connections in JAVA.

We found that the MySQL Connector/J supports some properties that are useful to establish SSL connections.

The property useSSL tells the server that we use a secure connection.
In this case the user ssluser was created with the command GRANT … REQUIRE SSL, ensuring that
can only connect by SSL.

import com.mysql.jdbc.*;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.Properties;
import java.util.logging.Level;
import java.util.logging.Logger;

public class Main {

    public static void main(String[] args) {
        Connection conn = null;

        try {
            String userName = "ssluser";
            String password = "password";


            String url = "jdbc:mysql://localhost:3333/mysql" //port 3306 of guest machine
                    + "?useSSL=true";

            conn = (Connection) DriverManager.getConnection(url, userName, password);
            Statement stmt = (Statement) conn.createStatement();

            ResultSet rs = stmt.executeQuery("select User,Host,ssl_type from mysql.user;");

            while ( {
                System.out.print(rs.getString(1) + " ");
                System.out.print(rs.getString(2) + " ");
                System.out.println(rs.getString(3) + " ");

        } catch (SQLException e) {
            System.out.println("SQLException: " + e.getMessage());
            System.out.println("SQLState: " +  e.getSQLState());
            System.out.println("VendorError: " + e.getErrorCode());
        } catch (Exception ex) {
        } finally {
            if (conn != null) {
                try {
                    System.out.println("Database connection terminated");
                } catch (Exception e) {  }

This simple program run well, it print the above table to stdout.

We wanted to implement the same application using certificates, but not, we have errors for which no solution yet found. The documentation, unfortunately not worked for us.

Anyway, as a great experience find everything we describe in this post. We learned a lot about cryptography …

Top 10 programs – Haskell version

20 04 2008

Last week I saw on Tom Moertel’s Blog the command to show the 10 most used programs:

history | awk '{print $2}' |sort | uniq -c | sort -rn | head

I do not show my top10 because I was expecting to have time to do a Haskell version 🙂

Based on .bashrc of Joachim Breitners I came up with this idea:

history | hmapw 'drop 1 . take 2' | hmapl List.sort | uniq -c | hmapl 'take 10 . List.reverse . List.sort'

But this is a kind of cheating, because I’m using the unix uniq.

So, I changed Joachim Breitners .bashrc to this:

if which ghc > /dev/null
        function ust { ghc-6.8.1 "-e interact ($*)" ~/Ust.hs ; }
        function ustmapl { ust "unlines.($*).lines" ; }
        function ustmapw { ustmapl "map (unwords.($*).words)" ; }

        function hmap { ghc-6.8.1 "-e interact ($*)" ;  }
        function hmapl { hmap  "unlines.($*).lines" ; }
        function hmapw { hmapl "map (unwords.($*).words)" ; }

And based on and Unix Simple Tools I did Ust.hs:

module Ust(tail10, pick, uniq_c) where

import Control.Monad.Instances
import Data.List

-- @
-- return the last ten lines of a file
tail10  = drop =<< subtract 10 . length

pick n = (:[]) . (!!n)

uniq_c l = [ nl (tam l) i s | (s,i) <- uniq_c' l]

tam = maximum . map snd . uniq_c'

uniq_c' [] = []
uniq_c' (h:t) = let (list,rest) = span (==h) t
                    n = length list + 1
                in (h,n) : uniq_c' rest

nl tam n line = let l = length $ show n
                    l_tam = length $ show tam
                    n' = replicate (l_tam-l) " "
                in concat n' ++ show n ++ " " ++ line

And here it is, the Haskell version:

history | sutmapw 'pick 1' | sutmapl 'reverse . tail10 . sort . uniq_c . sort'


pick 1 ["a","b","c"] = "b"
reverse [1,2,3] = [3,2,1]
-- 'tail10' return the 10 last elements of a list
-- 'sort' you know...
-- uniq_c = uniq -c (from unix)

sutmapl convert a String (with a lot of ‘n’) to the list of Strings separated by ‘n’.
sutmapw converts a list of String in a list of lists of Strings separated by ‘t’ or ‘ ‘.

The composition in Haskell works in the same way as the mathematical composition.
In the Unix console the pipe acts like (;), so we must reverse the parameters of the composition (.) to be equal to (;).

Btw, here is my top10:

623 make
433 sudo
380 cd
247 xpdf
176 ./game
175 ./client
129 man
117 ls
76 ./server
11 history

Lexical analysis Wikipedia

13 04 2008


This year I started to learn processing languages. I started by regular expressions and past few days I began to study the Flex, as with regular expressions we can’t create text filters.
The first job I did was a kind of a dictionary. Getting a source of words and a faithful translation, add all the information in one document.

The problem was finding a good document with many Portuguese words and their translations into English.

With this post I want to teach what I learned from this work.

Wikipedia XML structure

I started by picking up the last dump of Wikipedia-PT and “decipher” the XML where it is stored. The structure is something like that:


And each page tag, expanded, have this structure:

      <text xml:space="preserve">WIKIPEDIA_ENTRY</text>


TITLE is the Portuguese word, because the Wikipedia I’ve downloaded is in Portuguese.
But so far, no English word.


<text xml:space="preserve">

Here we have the ENGLISH variable that is the corresponding word in English to TITLE. I also want the CATEGORY variable that indicates from what category this entry belong.

As some entries in the Wikipedia have multiple categories lines I am also interested in keep them all.
I want that the output of my program became something like that:

Categoria - CATEGORY11
Categoria - CATEGORYi1

Some entries in the Portuguese Wikipedia do not have the English correspondent version so I will not want those entries.


A Lex file have this aspect:

user code

Let’s focus in rules part:

Rules have the following structure;

REGEX    code
REGEX    code

I know Regular Expressions (REGEX) now, so let’s start to build that thing!

I got me to realize that the part of the TITLE may have not only the entries name, also has the Wikipedia contributors pages, among other things that do not interest me to save.
I start to do a list of all those pages:

So, I have to, somehow, throw away all the pages with the following structure:

      <text xml:space="preserve">ENTRY</text>

After a dozen of lines I start to understand that I have to some how “explain” Lex the structure of the Wikipedia XML file. That way will be easier.

I start to read the Flex manual and I find the Start Conditions, a very clever way to treat a block of information.

Languages like C, HTML and XML are structured by blocks, so Start Conditions may be the way to easily get the information from those.

Start Conditions

If you have a block of code that have this aspect:


Our block start with “” and then have a name (that I want) until ‘>’ and then ended with “” string.

So in Lex, we must use Start Conditions and produce the following code:

%x title_sc
anything .|[\n\t\r]
"<title>"            BEGIN(title_sc);
<title_sc>[^>]+      {printf("title=%s\n",yytext);}
<title_sc>"</title>" BEGIN(INITIAL);
{anything}           {;} /* do nothing*/
main() {

The %x title_sc declaration is to declare a state exclusive, that means; while we are inside code_sc Flex won’t look rules outside of that, until BEGIN(INITAIL).

In definitions part we can declare variables anything .|[\n\t\r] to use in rules part as {anything}.

The BEGIN(title_sc) statement makes Lex jump to the first line of rule and it start to match the rules that finds there.

We can rewrite the above code like that:

%x title_sc
anything .|[\n\t\r]
"<title>"            BEGIN(title_sc);
        [^>]+        {printf("title=%s\n",yytext);}
        "</title>"   BEGIN(INITIAL);
{anything}           {;} /* do nothing*/
main() {

When Lex find BEGIN(INITIAL) statement it jumps to the first BEGIN(XXX), so we can never be able to use block’s inside other block’s (like XML does).

Of course that’s not true.

Start Conditions inside Start Conditions

Lex have a brilliant way to deal with that. It uses Stack’s to store the state were we are!

The idea is something like that, imagine a mathematical expression:


I can say:

  • 2 or [1,2]
  • 3 or [2,1]
  • 5 or [2,2,2]
  • and so on…

That’s all about keeping the path, our actual position in the tree.

So, now we replace the BEGIN(state) to yy_push_state(state), and to go to the previously block I say yy_pop_state().

With that now I can read structures like that one:

      <text xml:space="preserve">

And to do so, I write that Lex code:

%option stack
anything .|[\n\t\r]
notPage ["Wikipedia""Usuário""Discussão""Ajuda""Anexo""MediaWiki""Categoria"]
"<page>"                        yy_push_state(PAGE);
        "<title>"{notPagina}    yy_pop_state(); // not a valid page
        "<title>"               yy_push_state(TITLE);
        "<text"[^>]+">"         yy_push_state(TEXT);
        "</page>"               yy_pop_state();
        {anything}              /* do nothing */
        "[["[cC]"ategoria:"     yy_push_state(CATEGORIA);
        "[[en:"                 yy_push_state(EN);
        "</text>"               yy_pop_state();
        {anything}              /* do nothing */
        [^<]+                   {
                                        imprime(cat, pt, en);
                                        pt=NULL; en=NULL;
        "</title>"              yy_pop_state();
        {anything}              /* do nothing */
        [^\]]+                  en=strdup(yytext);
        [\]]+                   yy_pop_state();
        "]"\n{anything}         /* do nothing */
        [ \#\!\*\|]+            yy_pop_state();
        [^\]\|\n]+              {
        [\]]+                   yy_pop_state();
        "]"\n{anything}         /* do nothing */
{anything}                      /* do nothing */
int main() {
        return 0;

As you can see we are all the time matching a rule that makes lex jump to another state until one terminal rule (in this case variables affectations) or until pop.

To see all the code go here(.lex).

If you understand Portuguese and feel interested in more information you can go here(.html).


flex: The Fast Lexical Analyzer

Configurar eduroam wpa_supplicant

9 11 2007

Como muita gente vem ter aqui ao blog por pesquisar “eduroam linux uminho” ou coisas parecidas, aqui vai o método que eu usei.

Configurações para a rede eduroam na UMinho por wpa_supplicant para Linux, funciona em Ubuntu{7.04, 7.10}, não testei em mais nenhuma distro, mas certamente funciona em qualquer Linux:

Sacar estes 3 ficheiros.

# wget -r --no-parent -R "*.html*"

# cd

Copiar o ficheiro cacert-scom.cer, interfaces e wpa_supplicant.conf para os devidos sitios:

# cp cacert-scom.cer /etc/ssl/certs/
# cp interfaces /etc/network/
# cp wpa_supplicant.conf /etc/wpa_supplicant/

No ficheiro wpa_supplicant.conf substitui-se ALUNO pelo número de aluno e PASSWORD pela respectiva.

Está feito! De seguida apenas temos que nos autenticar na rede, desta forma, onde INTERFACE é, na maioria dos casos eth1.

# iwconfig INTERFACE essid eduroam enc open
# wpa_supplicant -Dwext -i INTERFACE -c /etc/wpa_supplicant/wpa_supplicant.conf

depois de receber uma mensagem de sucesso de autenticação apenas temos que pedir um IP;

# dhclient INTERFACE -r
# dhclient INTERFACE

Agora sim… com isto já devemos de ter net na Universidade do Minho.

Ainda existe um outro método, que está comentado no ficheiro interfaces, mas esse método serve-se do acima descrito.

As configurações para Windows estão bem mais facilitadas, neste zip.


27 10 2007

Tenho reparado que muitas das pesquisas sobre configuração da rede eduroam estão a ser encaminhadas para o blog, como disse já tenho a rede a funcionar bem no meu portátil. Na altura não expliquei como se configurava em linux porque o Sr. Amândio Gomes disse que ele o iria fazer e meter no campusvirtual, fui lá hoje ver e pelos vistos já lá está desde 22 de Outubro.

Aqui a configuração para Fedora Core (5/6/7) e Ubuntu (7.04), creio que também funciona em 7.10, se experimentarem digam se funcionou. Tenho o 7.04 e foi essa a configuração que na altura fiz.