Capture and dissect network traffic

1 04 2009

Currently I am doing research at the University of Minho in the group of distributed systems, with duration of one year. My job is to find a way to identify specific links between a user and a distributed system. The general idea is to draw a map of services in a distributed system. This post only refers to the first milestone.

The proposal was to make such a system using Snort.


Snort is a Network intrusion detection system, that means with Snort you can detect malicious activity in your network. We can detect many types of network attacks. We can identify DoS, DDoS attacks, port scans, cracking attempts, and much more.

Snort can operate in two different ways. We can set up Snort to run in passive mode, putting it to listen in promiscuous mode. That is, because Ethernet network switches send traffic to all computers connected to itself, we get traffic addressed to other machines on the network. To do this we only need to connect to the network and turn Snort on in our machine, no one knows that we are recording every traffic (including traffic destined for other computers).

Snort may also run in active mode. This “active” is not able to modify the data channel, but to be able to be installed in a network, a router for example and reap more information than in passive mode. Thus it makes sense to use the capacity of rules that Snort supports, to filter the traffic that it read.

To do this, Snort capture all packets that pass the network and interprets each. As the rules we have defined Snort tries to find these patterns in each packet, or each set of packets and take certain actions for each of them.

For example, if a large number of TCP requests reach a particular host, to a large number of ports in a short space of time we probably are the target of a port scan. NIDS like Snort know how to find these patterns and alerting the network administrator.


Our aim was to use Snort to capture all traffic into passive mode.

root@pig:# snort -u snort -g snort -D -d -l /var/log/snort -c /etcsnort/snort.debian.conf -i eth0

We are saving the logs in binary (tcpdump format), for that I use the “-d -l /dir/” flags. I prefer to save all the packets into binary because is more easier to parse, than the structure of files and directories that Snort creates by default.

I started by trying to use some language that advised me to try to do the parsing of the file created by snort. Initially started to use python, but only find a tcpdump parser and could not get more than one file translated in tcpdump to hexadecimal.
After that I tried to use Haskell and I was amazed!

Haskell and packet parsing

House is a Haskell Operative System done by The Programatica Project.

This is a system than can serve as a platform for exploring various ideas relating to low-level and system-level programming in a high-level functional language.

And indeed helped me a lot in doing my job. This project have already done a lot of parsers for network packets. It implements the Ethernet, IPv4, IPv6, TCP, UDP, ICMP, ARP and I think is all.

The libpcap (tcpdump parser) is already implemented in Haskell too, so is very simple to parse a complete packet:

getPacket :: [Word8] -> InPacket
getPacket bytes =  toInPack $ listArray (0,Prelude.length bytes-1) $ bytes

-- Ethernet | IP | TCP | X
getPacketTCP :: [Word8] -> Maybe (NE.Packet (NI4.Packet (NT.Packet InPacket)))
getPacketTCP bytes = doParse $ getPacket bytes :: Maybe (NE.Packet (NI4.Packet (NT.Packet InPacket)))

As you can see is too easy to have a compete structure of a packet parsed with this libraries. The problem is that they don’t have already implemented a application packet parser. So, according to that image:

This is the level of depth we can go with this libraries. What is very good, but not perfect for me :S

My supervisor told me to start searching a new tool to do this job. I was sad because I could not do everything in Haskell. But it is already promised that I will continue this project in Haskell. You can see the git repo here.

I find tshark, a great tool to dissect and analyze data inside tcpdump files.

The power of tshark

tshark is the terminal based Wireshark, with it we can do everything we do with wireshark.

Show all communications with the IP

root@pig:# tshark -R "ip.addr ==" -r snort.log
7750 6079.816123 -> SSHv2 Client: Key Exchange Init
7751 6079.816151 -> TCP ssh > 51919 [ACK] Seq=37 Ack=825 Win=7424 Len=0 TSV=131877388 TSER=1789588
7752 6079.816528 -> SSHv2 Server: Key Exchange Init
7753 6079.817450 -> TCP 51919 > ssh [ACK] Seq=825 Ack=741 Win=7264 Len=0 TSV=1789588 TSER=131877389
7754 6079.817649 -> SSHv2 Client: Diffie-Hellman GEX Request
7755 6079.820784 -> SSHv2 Server: Diffie-Hellman Key Exchange Reply
7756 6079.829495 -> SSHv2 Client: Diffie-Hellman GEX Init
7757 6079.857490 -> SSHv2 Server: Diffie-Hellman GEX Reply
7758 6079.884000 -> SSHv2 Client: New Keys
7759 6079.922576 -> TCP ssh > 51919 [ACK] Seq=1613 Ack=1009 Win=8960 Len=0 TSV=131877415 TSER=1789605

Show with a triple: (time, code http, http content size), separated by ‘,’ and between quotation marks.

root@pig:# tshark -r snort.log -R http.response -T fields -E header=y -E separator=',' -E quote=d -e frame.time_relative -e http.response.code -e http.content_length

Show a tuple of arity 4 with: (time, source ip, destination ip, tcp packet size).

root@pig:# tshark -r snort.log -R "tcp.len>0" -T fields -e frame.time_relative -e ip.src -e ip.dst -e tcp.len
551.751252000  48
551.751377000   144
551.961545000  48
551.961715000   208
552.682260000  48
552.683955000   1448
552.683961000   1448
552.683967000   512
555.156301000  48
555.158474000   1448
555.158481000   1400
556.021205000  48
556.021405000   160
558.874202000  48
558.876027000   1448

Show with a triple: (source ip, destination ip, port of destination ip).

root@pig:# tshark -r snort.log -Tfields  -e ip.src -e ip.dst -e tcp.dstport
...   37602   37602  22   37602  22  22   37602   37602   37602  22  22  22  22   37602   37602


Hierarchy of protocols

root@pig:# tshark -r snort.log -q -z io,phs
frame                                    frames:7780 bytes:1111485
  eth                                    frames:7780 bytes:1111485
    ip                                   frames:3992 bytes:848025
      tcp                                frames:3908 bytes:830990
        ssh                              frames:2153 bytes:456686
        http                             frames:55 bytes:19029
          http                           frames:5 bytes:3559
            http                         frames:3 bytes:2781
              http                       frames:2 bytes:2234
                http                     frames:2 bytes:2234
          data-text-lines                frames:10 bytes:5356
        tcp.segments                     frames:3 bytes:1117
          http                           frames:3 bytes:1117
            media                        frames:3 bytes:1117
      udp                                frames:84 bytes:17035
        nbdgm                            frames:50 bytes:12525
          smb                            frames:50 bytes:12525
            mailslot                     frames:50 bytes:12525
              browser                    frames:50 bytes:12525
        dns                              frames:34 bytes:4510
    llc                                  frames:3142 bytes:224934
      stp                                frames:3040 bytes:182400
      cdp                                frames:102 bytes:42534
    loop                                 frames:608 bytes:36480
      data                               frames:608 bytes:36480
    arp                                  frames:38 bytes:2046


We use: -z conv,TYPE,FILTER

TYPE could be:

  • eth,
  • tr,
  • fc,
  • fddi,
  • ip,
  • ipx,
  • tcp,
  • udp

And the filters are used to restrict the statistics.

root@pig:# tshark -r snort.log -q -z conv,ip,tcp.port==80
IPv4 Conversations
                                |           | |    Total    |
                                |Frames Bytes | |Frames Bytes | |Frames Bytes | 141    13091    202   259651    343   272742     22     6858     28     4784     50    11642


We use: -z io,stat,INT,FILTER,…,FILTER

root@pig:# tshark -r snort.log -q -z io,stat,300,'not (tcp.port=22)'
IO Statistics
Interval: 300.000 secs
Column #0:
                |   Column #0
Time            |frames|  bytes
000.000-300.000    2161    543979
300.000-600.000    1671    264877
600.000-900.000     508     46224
900.000-1200.000     185     12885
1200.000-1500.000     201     14607
1500.000-1800.000     187     13386
1800.000-2100.000     189     13887
2100.000-2400.000     187     13386
2400.000-2700.000     189     13887
2700.000-3000.000     187     13386
3000.000-3300.000     185     12885
3300.000-3600.000     189     13887
3600.000-3900.000     210     15546
3900.000-4200.000     189     13887
4200.000-4500.000     187     13386
4500.000-4800.000     185     12885
4800.000-5100.000     189     13887


With tshark we could do everything we want to know what is inside a network packet. The trick is to understand the statistics that tshark generate, and know how to ask it.

Now my work will get a machine to run Snort in an active mode and begin to understand how to use Snort to do all this work of collecting information.

If you feel interested and understand Portuguese, see the presentation:


31 07 2008


Since long ago that I wanted to write about this. Finally I now have the time to write about syntactic analyzers generators.

When I was studding yacc noticed that there was much information available, not something one would expect of a tool with 40 years. Unfortunately there was no quick learning for someone who already knew the whole theory behind this kind of tools. As I said, the yacc is a very old tool, made when was needed compilers. Before there the whole theory of that area that we know today. So it is normal small errors transform themselves into disasters that may take a long time to be resolved.

This post is for those who already know what grammars are, lexical analyzers, etc., but that never touch in Lex/yacc or for those who are stuck with this things (I hope).

In this post I will explain to you Lex/yacc, with an example of a program that I did. I will not boring you with massive lines of code (like I usually do), because Lex/yacc is just a simple tool to use. With just 60 lines of explained code you will see how great that is.

The problem

My idea was to do a simple site with all the information about all the cities in Portugal. After reminded me to so but to all parishes (Freguesia in Portuguese), counties(Concelho in Portuguese) and districts(Distrito in Portuguese) of Portugal and their relationship. A district has many counties and one county has many parishes.

Simple! What I needed now was all the names of all parishes, counties and districts of Portugal, I eventually found three files with all this information and more in the Portuguese company of post office website. Better yet, these files had the relationship between each division. I have to do a small program in C that take those three files and create a new file, which I use on the Lex / yacc.

In order to have all available information on each local od Portugal I decided to use Wikipedia. This is the Achilles heel of the program, because unfortunately the entries of wikipedia on the Portuguese parishes are not uniform and many of the parishes or have entry. But still able to have good results.

The structure of the new file generated is:



It have about 43000 lines, the number of parishes in Portugal.


An Yacc file is just like an lex file, it is divided in three parts:


I Will explain later on, what to put in all those three parts.

The solution

I start to write the following grammar to describe my new file:

OrGeo     -> Districts

Districts -> District '|'
          | Districts District '|'

Distrito  -> IdD Link Counties

Counties -> Countie
          | Counties ';' Countie

Countie  -> Parishes '!' IdC Link

Parishes    -> Parish
          | Parishes ',' Parish

Link      -> '>' l

IdD       -> id

IdC       -> id

Parish     -> IdL Link

IdL       -> id

As id is a name (of the District, Countie or Parish), we will declare it in yacc as a pointer to chars (vals). To do that we create a union like in the first part of yacc file (DECLARATIONS), and add a association with vals with id and l:

%union {
        char *vals;

%token  id l

Yacc use that union because you can declare as many associations as you want. We must refer to that union in the lex as yylval.

This is the same as use:

%token  id
%token  l

We now go to lex file and add the rule that the lexical analyzer will meet when find text.

[ a-zA-ZÀ-ú-'.()/'`0-9]+  { yylval.vals = strdup(yytext);
                                    return id; }
[a-zA-ZÀ-ú-'.()/'`0-9_:]+ { yylval.vals = strdup(yytext);
                                    return l; }

Here we are saying that when lex find some text that fills in that regular expression it will return to yacc as an id, so that way we find the names of our cities, or links.

As you can see we have special symbols in our grammar, (!>,;|), so we need to say to lex to return them to yacc, where we need them:

[!>,;|]                           { return yytext[0]; }

I also will say to lex ignore all n and t characters:

[tn]                            { ; }

Making our grammar powerfull

Now our grammar will suffer some adjustments; we will say yacc what to do when it was in some of the derivations of some rule:

OrGeo     : Districts { d = $1; }
Districts : District '|' { $$ = $1; }
          | Districts District '|' { $$ = catDistricts($1,$2); }
District  : IdD Link Counties { $$ = addDistrict($1, $2, $3); }
Counties : Countie { $$ = $1; }
          | Counties ';' Countie { $$ = catCounties($1, $3); }
Countie  : Parishes '!' IdC Link { $$ = addCountie($1, $3, $4); }
Parishes    : Parish { $$ = $1; }
          | Parishes ',' Parish { $$ = catParishes($1, $3); }
Link      : '>' l { $$ = $2; }
IdD       : id { $$ = $1; }
IdC       : id { $$ = $1; }
Parish     : IdL Link { $$ = addParish($1, $2); }
IdL       : id { $$ = $1; }

Here we tell yacc how to behave when pass certain derivation of a rule.
We can tell yacc that some rule can return a data type, for example:

#include "district.h"

District *d;

%union {
        char *vals;
        Parish *valf;
        Countie *valc;
        District *vald;

%type   Link IdL IdD IdC
%type   Parish Parishes
%type   Countie Counties
%type   District Districts

To return something in a rule we refer to that rule as $$, that mean IdL in

IdL       : id

and $1 to refer to that id, and so on. So, that way we can say that $$ = $1, that means, IdL = id.

Funtions catDistricts, addDistrict, addParish, addCountie, catCounties and catParishes, are just functions to create linked list’s and append new elemt’s in one already existent linked list.

The result is a html page with nested liked lists, here is the result.


All the code, including the code to generate the new file, used to yacc is available here, fell free to use it.