Unsollicited advice about stackframes
Stacks are everywhere
In my daily life I do a lot of debugging, reverse engineering, and software development. For some of these tasks, understanding how computer works in unnecessary detail is, well, sometimes useful. In particular, stacks and stackframes are very central to how most modern computing works. And, annoyingly, the implementation is slightly different depending on CPU architecture, compiler, operating system, but also CPU generation and compiler options.
Every time I need to dig into it, I have to refresh my memory or re-understand how it works on a specific setup. I may save a lot of time, for myself and hopefully for others, if I start writing it down. So, here you go, my currently very incomplete guide about how stacks and stackframes work!
This article does not attempt to retrace the history of how all features of a computer call stack work. For that I can recommend Sten Henriksson's A brief's History of the Stack, which will take you all the way to Babylonian folklore, with a brief detour through tax avoidance schemes. Here I will introduce concepts in a way that seem convnient for explaining them.
Note: I start this article from the very beginning. Maybe the relevant bits for you will be near the end. But this allows to introduce the notations and concepts progressively and make the article somewhat standalone, and answering more questions than it opens. Skip some sections if you feel bored, come back to them later if you are lost, let me know if more details about some parts are needed!
What's a stack?
A stack is a data structure where you can only access the top element. More precisely, you can:
- push
- a new element to the top of the stack. It is stored "above" the previous one.
- pop
- the top element from the stack. The previous "top" becomes visible again.
- peek
- at the top element to get its value. This does not change the stack.
Of course, popping and peeking is not allowed if the stack is empty.
In its simplest form, a stack is just a pointer to the top element. From now on we will not consider linked stacks, as these are rarely used in modern computers. Instead we will consider that the elements are stored next to each other, in consecutive memory addresses. This results in the following definition for the 3 operations:
- peek
- is just dereferencing the stack pointer,
- pop
- is decrementing the stack pointer,
- push
- is incrementing the stack pointer and writing the new value there.
From now on, the stack pointer will be notated as SP.
Call stacks
The concept of a call stack may seem obvious, as it is deeply engraved into how modern programming languages and CPUs work. It was not always this way. Going without a stack, or at least allowing yourself other ways to manage control flow in your code, would allow for other ways to think about it, the most popular of these probably being coroutines.
Anyway, let's look at how the typical functional/imperative programming paradigm nowadays typically uses a stack for control flow.
Usually, a CPU executes instructions one after another, in the order they are stored in memory. But if allowed only that, computers would be a lot less smart. Some instructions (usually called "control flow instructions" allow to "jump" around in the code. In particular let's have a look at the CALL and RETURN instructions (on z80 and x86 these are called CALL and RET, other CPU architectures may use different names such as BSR (branch to subroutine) or implement things slightly differently. More details about specific architectures are in later sections of the article).
These instructions allow to implement the notion of a procedure or subroutine call. The idea is simple: there are some tasks in a program that need to be performed several times in different contexts. Repeating the same code over and over in many places would make the software difficult to maintain, as all these copies of the code would have to be kept in sync, and, moreover, it woudl be waste of memory space. So, let's have only one copy of this code. Whenever we need to run that code, we can call into it by using the CALL instruction. When the code is finished, it uses the RETURN instruction to tell the CPU, and execution resumes in the initial code flow, just after the CALL.
One important observation here is that the code in the subroutine does not make any assumptions about where it was called from. The RETURN instruction is just that, RETURN. It does not have any parameters. So how does this work? Well, the CALL instructions stored the return address, and the RETURN instructions retrieved it. Of course, we are not limited to only one level of CALL and RETURN. The subroutine may itself need to call more subroutines, and so on. So, these addresses are stored on a stack, with the CALL instruction doing a PUSH, and RETURN doing a POP.
Parameters and recursion
This is all nice and fine, but it only works if the subroutine needs to perform exactly the same work each time it is called. However, this is unlikely to be the case. Typically it would need to perform the same operations, but work on different data. For this, we need to be able to pass some information to the subroutine so it knows on which data to operate. This is called parameters.
There are many ways to achieve this. For example, we could put the parameters in CPU registers or fixed memory locations (agreed in advance between the subroutine and the caller). CPU registers have an obvious limitation: there are only so many of them, what if a function needs more parameters, or parameters too large to fit in registers? So let's look closely at the other option and see how it would look like.
For each subroutine, we will dedicate a small amount of memory for its arguments. Before calling, the caller code will set these to the parameter values, and the subroutine can access them as needed. Problem solved, right?
Unfortunately, there are two limitations with this approach. One is just an inefficiency problem, but the other is more structural and makes some things impossible. Let's start with the first one.
The inefficiency here is that there are usually not that many subroutines running at the same time. However, the memory allocated to each subroutine parameter is static: it is reserved for the use of that subroutine's parameters, and can't be used for anything else, even when the subroutine isn't running. Usually, subroutines don't take a lot of parameters, so each of them only uses up a small amount of memory, but there will be a lot of subroutines, and this memory usage adds up.
But let's not forget the other, more structural problem: what if we need to run the same subroutine multiple times at once? At first this may seem impossible, since the code execution is generally linear, but there are two cases where this can happen. The first one is recursion: a subroutine may call itself to perform some work. The second one is multitasking. This may happen because you have a system with multiple execution threads, or more simply while handling hardware interrupts, which stop the currently running code to do something else.
Both of these problems can be pretty annoying: one to the hackers who hate wasting even a single byte of memory, and the other to computer theorists who like to express their algorithms in recursive form for elegance, or really want to play with concurrent threads of execution on the same machine. As a result, everyone was annoyed and a solution was found.
Static, dynamic, and automatic variables
The solution is automatic variables, but for completeness I will also introduce the other two types of storage, if only to explain the naming used here.
We have already seen in the previous paragraph what a static variable is, but let's define it clearly. A static variable is one whose address in memory is fixed (hence the name static: it doesn't ever move). The address for these variables is determined at compile time (I am using a very liberal definition of "compile time" here, let's say, "before the program starts executing"). If the programmer wishes to reuse the same address for two different variables, it is possible, but it is up to the programmer to make sure the two variables will never be in use at the same time. This is way too much information to track for an human brain, and so this approach was quickly abandoned.
The second option is dynamic allocation. In this case, the memory address for a variable is determined while the program is executing. This is typically done with the malloc() function (at least in systems using the C programming language). Here, the programmer asks for a piece of memory large enough to hold their variable. When the variable is not in use anymore, they use another function (free()) to give that chunk of memory back to the system. The memory allocator keeps track of which chunks of memory are available, and, as long as the programmer uses the malloc and free functions correctly, the memory is managed cleanly, and there is no risk to reuse the same chunk of memory for two things.
There are, once again, two problems with this approach. One of performance, and one of stability and predictability. Let's again start with the first one. Over the years, many approaches were suggested for dynamic memory management. Besides the simple malloc and free, let's mention reference counting and garbage collection. They all suffer from the same problem: a lot of work has to be done "at runtime", while the program is executing, just to keep track of which parts of the memory are or are not in use. The more generic and flexible the memory allocation scheme is, the higher the overhead gets. The algorithms in play are also quite complex and do not lend themselves well to being efficently optimized by hardware.
The other problem is, of course, more annoying. With the simple malloc/free scheme, there is still a lot of reliance on the programmer to correctly declare what memory they need, and to release it when done. Decades of experience show that programmers will usually not get this right. And since everything is dynamic, this results in complex to diagnose bugs, that are not always predictably reproductible. Dynamic memory is still in use today, but it is too slow and dangerous to use as a basic building block for something as simple as calling a function with parameters.
This is where automatic variables come in. The name "automatic" derives from the fact that there is no need to explicitly allocate and free these. Their lifetime is automatically managed by the control flow of the program: they are allocated by calling a function, and freed when returning. This is so convenient that, in C, it became the default way to allocate memory for local variables. You can still explicitly use the "auto" keyword when declaring a local variable, but no one bothers to do it, since it is the default in pretty much all C compilers.
For the implementation, the idea is simply to put these variables on the stack. This is done typically using instructions like PUSH and POP (again, exact instruction naming will vary, and we will talk about specific CPUs later, I promise!). The compiler now takes care of pushing the parameters to a function before calling it, and cleaning up after the function returns.
Let's have a look at how things are organized in memory now (here for a traditional UNIX type memory layout, there are of course other ways to set things up).
I will not get into all the details of why things are in this specific order today. The interesting point here is that both the heap (used for malloc/free) and the stack share the same RAM area (whatever is free after the static code, constants and variables are loaded). The heap gets to use this area from its start, and the stack uses it from the end. The point where they meet is called the "break" or "brk". This organization allows to efficiently share that block of memory between the two, without having a fixed size for either, and things will sort of auto-adjust depending on how much stack and heap each execution of the program needs. Anyway, the main point here is, to achieve this, we had to turn the stack upside-down. Now, pushing things to the stack will decrement SP, and poping will increment it. This is a bit disturbing at first, and sometimes in the litterature you see the stack still represented with the topmost item on the top. But we will see in the next chapters that having the stack upside down allows some neat tricks, so I think it is important to show it the "right" way, however confusing it may be at first. Maybe picture a stack of helium baloons, or some antigravity powered device?
To avoid further confusion, I will avoid using the notion of a stack "top" from now on. I will instead say "head".
Local variables
Whew, I think that's where we end the basics introduction and finally enter the part I wanted to write about in the first place?
I realize that I have not shown a stack diagram in a while. Let's see how it looks now with our complete setup, that is:
- A main function calling a subroutine,
- The subroutine has three parameters,
- The stack grows downwards (towards low memory addresses).
Let's look at it just after the CALL instruction has put the return address on the stack.
We immediately notice a problem: until now, we were only able to access the head of the stack. In this case, that would be the return address, put there by the CALL instruction. But our subroutine will not need the return address until the very end, and surely will use the parameters a lot before reaching that point. Without any other primitives, we would need to store the return address somewhere else in order to access the other parameters, this seems pretty annoying.
Fortunately, the stack is stored in random access memory (at least in current computer architectures that we care about here). So, we can also do random access to it. Let's introduce a new notation for that: SP-relative addressing. The idea is very simple, let's use our stack pointer SP as a base pointer and process the stack as if it was an array. This leads to: SP[0] being the return address, SP[1] being the 3rd argument, SP[2] being the second and... wait, this argument order seems not very helpful?
Indeed, if the caller function pushes the arguments in that order (first to last), on a stack growing downwards, and we then access the stack as an array, we get the arguments in the reverse order of the array indices. The C language solves this simply by having the caller push the arguments last-to-first instead.
So let's have another look at a bit of pseudocode and the resulting stackframe with fixed argument order:
PUSH arg3 PUSH arg2 PUSH arg1 CALL subroutine
Now let's talk about local variables. Unlike arguments, these are not set by the caller function, but are handled by the subroutine for its own use. Looking at our stack, we see that there is a lot of space just below SP. So let's just put our local variables there. We could access them as SP[-1], SP[-2], etc. But we have to keep in mind that the stack is still... well a stack. Any function call will use that same space we just set aside for local variables to store the parameters and return address for the called function. And, even more annoyingly, interrupts may happen at any time and also make use of the stack to store their own temporary data. Of course, we don't want either of these to alter our local variables.
The solution is to move the stack pointer downwards and "reserve" some space for the local variables. Typically, upon entering a function, you will see an instruction like SUB SP,nn to do so (if your processor has such an instruction, not all of them do, and achieving the same thing without a dedicated instruction can require some tricky code. We are getting into the details of specific implementations soon, I promise!).
Ok, this looks good. We lost the direct mapping between SP[n] and argument n, but the compiler can compensate that for us, as the offset is fixed and depends just on the number of local variables. Or is it?
More fun with local variables
Unfortunately, it is not quite so. There are several cases where the stack pointer has to be moved to different addresses.
Note: it's possible that some of these cases were introduced after the solution exposed later in the article was set in place, as they then became trivial to implement. Again, this is not an history of how things developed, just a convenient way to build things up progressively in this article.
Scoped allocations
In the C language it is possible to do something like this:
void subroutine(void)
{
int aVariable;
// Some code here...
{
int bVariable;
// Some more code here...
}
// Even more code here...
}
Basically, a variable can have a scope (and lifetime) shorter than an entire function. In this situation, it would make sense that the compiler temporarily allocates spaces for the new variable at the opening of the corresponding block, and releases it at the closing block. If implemented strictly in this way, it means the offset to access aVariable would be SP[0] while outside the block, but SP[1] in the block where bVariable exists.
This problem can be avoided, of course, by keeping bVariable allocated from the start to the end of the function, which is what most compilers will do. So let's look at the real problems below instead.
Call frame setup
The call frame setup is what happens when a function needs to call another one. Because of the function arguments, this is a multi-step process: first push the arguments one by one, and then call the function. Each of these PUSH and CALL instructions move the stack pointer. What if we want to pass the value of a local variable or of one of our own arguments to the called function? The compiler then has to keep track of how many things it already pushed, and make sure to adjust all accesses to local variables and arguments accordingly. Relatively easy job for the compiler, but not much fun for the developper, as they also have to track this when reading the generated code (for example in a debugger). What is SP[1] at one place in the code may well become SP[6] a bit further down, and then change again and again. Wouldn't it be much easier if things stayed where they are?
Variable length arrays
This one is possibly the most annoying and difficult to solve the for the compiler. Sometimes, the size of a local variable is not known at compile time. Let's say your function needs a temporary array to perform some work. The size of the array depends on a parameter. Something like this:
void function(int count)
{
int array[count];
// Perform some work on the array...
}
Or, if you use an older version of C where variable length arrays are not a thing, a similar approach with the alloca() function, which just allocates memory on the stack:
void function(int count)
{
int* array = alloca(count);
// Perform some work on the array...
}
Here, the compiler does not know at compile time how to adjust the references to local variables, since it depends on a runtime parameter. So, we'll need some additional data at least in this case.
Other uses of the stack: register spills
Finally, the compiler may generate PUSH, POP and other stack access instructions for many other reasons. One of them is "register spills": temporary storing registers in the middle of a calculation to the stack, and getting them back later. This is how the stack is typically used on, for example, the z80 and 8080, where it was not designed for SP-relative addressing.
Introducing frame pointers (finally!)
Looking carefully at usages for the stack pointer, we see that there are, in fact, two different usages for it.
- As a stack pointer: with the typical PUSH and POP operations, as well as CALL and RETURN.
- As an array base pointer: for access to function arguments and local variables.
These two tend to work together (for example: the caller uses PUSH to send arguments, but the callee rather accesses them as an array). But they also work against each other when SP needs to be moved and it offsets all access that use it as an array.
If you look at it this way, the solution becomes obvious: let's split these two roles into two separate registers! We will keep SP as the stack pointer, and introduce BP to use as the array base pointer (using the x86 name for the register this time, it is sometimes also called FP for frame pointer).
This requires a bit of extra work at the start and end of each function, to point BP at the right place, and restore it to its previous value when returning to the caller. In general, this will involve saving the previous value of BP to the stack itself. This opens an interesting new way to look at the stack memory: in addition to being a stack and an array, the successive values of BP saved to the stack can be used to traverse it as a linked list! This is used in debuggers in order to print a call stack, but it can also find other uses such as unwinding the call stack in languages that support exceptions.
This is where the exact details depend on the exact CPU used, and it will be time to look at it in more details in the next sections of the article. I will complete it progressively as I come accross various schemes, and try to build similar quickreference cards for each of them. So, please have a look at the next sections for the details.
8086 - Unknown compiler used for Philips PM3580 custom disassembler tool
| Function prologue | PUSH BP ; Save the caller BP MOV BP, SP ; Initialize BP to the current frame MOV AX,frameSize ; Allocate space for local variables CALL stackAdjust PUSH r ; As needed if some registers need to be saved PUSH s ; As needed if some registers need to be saved |
|---|---|
| Function epilogue | POP s ; Matching what was saved in the prologue POP r ; Matching what was saved in the prologue MOV SP, BP POP BP RETN |
| Call template | MOV AX, argN PUSH AX MOV AX, argM PUSH AX ... MOV AX, arg0 PUSH AX CALL subroutine ADD SP, nArg ; Caller is responsible for removing arguments from the stack |
| Stack adjust helper | stackAdjust: POP CX ; Remove address for stackAdjust from stack MOV BX, SP SUB BX, AX ; Reserve stack space JB .stackOverflow CMP BX, stackLimit JB .stackOverflow MOV SP, BX ; Set SP to the adjusted address JMP CX ; Return to the caller function .stackOverflow XOR AX, AX JMP runtimeError ; Trigger a runtime error 0 and abort the program |
The stack adjust routine is a separate function in the prologue to add bounds checks for the stack. This is optional. In fact, functions from the standard library in this disassembly projet use a simpler prologue:
PUSH BP MOV BP, SP SUB SP, frameSize ; Direct change of SP without bounds checks PUSH r ; As needed ... PUSH s ; As needed ...
Analysis tips
When reading the code it can be difficult to tell at first glance if a particular PUSH instruction is setting up a function argument for a call, or just saving a register for restoring it later. Sometimes, you can look at the instruction after the CALL to get a hint: it will be an ADD SP to remove the arguments from the stack, and its parameter gives you an hint about the arguments total size. However, the compiler may also optimize this and make it harder to follow. For example, if the code calls several subroutines in succession, the arguments for several of them may be kept on the stack, until a final ADD SP,xx removes them all at once.
68000 - Unknown compiler used for Philips PM3580 custom disassemblers
| Function prologue | LINK.W A6, #frameSize MOVE.L r, -(A7) ; As needed if some registers need to be saved |
|---|---|
| Function epilogue | MOVE.L -$4(A6), r UNLK A6 RTS |
| Call template | MOVE.L A0,-(A7) MOVE.L A1,-(A7) JSR subroutine ADDQ.W #$8, A7 |
The 68000 uses A7 as a stack pointer. It does not enforce a specific register to use as a frame pointer, but A6 is a common choice. The instructions LINK and UNLK allow to do in a single instruction what the 8086 is doing in three: save the old frame pointer, set the new one, and subtract a constant value from the stack pointer to make space for the stack frame.
The UNLK instruction does the opposite in the function epilogue. Note that, since it resets the stack pointer, restoring the registers is done using A6-relative addresses and doesn't bother to increment the stack pointer when getting the values.
When multiple registers need to be saved, MOVEM instructions can be used. For a big function something like MOVEM.L D2-D7/A2-A5, -(a7) will be typical, with the matching movem.l -$128(A6), D2-D7/A2-A5 in the prologue. Note that in this case, D0, D1, A0 and A1 are considered "scratch" registers, the called function is free to modify them and the caller does not assume their value will be preserved. So simple functions that only need these four registers may end up with a 1-instruction prologue and 2-instruction epilogue!
The way to call functions is very similar to the 8086, almost the same instructions, just with different names.i The 68000 wins because it has an instruction to push a constant directly onto the stack: MOVE.W #$6, -(A7) for example. On the 8086, you have to first load the constant into a register, and then push that register.
Analysis tips
The epilogue does not have any parameters, it is always the same two instructions. This make the resulting 4 byte sequence (4E 5E 4E 75) easy to identifiy when looking for function boundaries.
