Stackpocalypse

The definitive brain dump relating overflows, worms, recursion, tail call optimizations, high-level code and assembly.

Last updated on June 9, 2020. Created on June 8, 2020.

Unloading the Dishwasher

Imagine you’re unloading plates from the dishwasher. Brimming with envy and generalized vitriol, you stack plate after plate in the space where the plates go. “That’s where they go, I told you,” you recall telling the bipedal parasites who dwell in your suburban household. But they don’t understand, do they? They don’t understand that the plates go in the plate space, thus you will take it upon yourself to do that which is needed.

You’re reminded that the plate space is a finite space, because there are more plates than normal on this overcast and dreary morning. Apparently your suggestions to “reuse” plates throughout a given day have gone unheard or, more likely, ignored. You observe that your cabinet may not support the sheer number of those plates you diligently cleaned, yet you press on.

If the others were to learn there are times when all the plates cannot be simultaneously stacked, then you've reason to worry: They may decide that few plates should ever be stacked, if any whatsoever.

It’s akin to opening Pandora’s Box. Why stack plates at all? The amalgamation of clean and dirty dishes scattered about the sink and dishwasher would inevitably lead to a holy war of wasted water and detergent. Morning coffee? Forget about it. How would you fill the carafe with water, when the sink and surrounding counter space are occupied by utensils laden with mystery goop?

“I can’t let this happen,” you think, “I must save my family.”

Limited Stack Space

Compilers, interpreters and even the pitiful souls who connect to Oracle databases are affected by this class of problem too. That is, when functions are called, and especially when those functions are recursive. After all, the stack space is limited, just like your cabinet space. By the way, in the context of computing, when I say stack, I implicitly mean either a) the last-in-first-out (LIFO) stack data structure, or b) a call stack, which is a set of memory locations conformant to said data structure.

The stack data structure can be composed with a linked list, dynamic array, or vector. Its defining and characteristic operations are push and pop. Push means to add a node or element on top of the stack. Pop means to remove the top node from the stack. Thus, adding a plate to a stack of plates is, in effect, a push operation. Removing one is analogous to a pop operation. This all assumes you don’t lift up plates, adding and removing at arbitrary locations other than the top. The stack data structure is lazy like you.

Illustration of pushing and popping a stack

Pushing and popping a stack by Maxtremus.

Anybody can understand a stack of plates. Indeed, you know you can only stack plates to a limited height within your cabinet. When a computer program runs out of space to stack its plates, or stack frames, the result is a stack buffer overflow. A buffer is like the space in a cabinet: a limited and temporary region of storage. Here, the buffer stores stack frames.

Consider what would happen if you stored more plates in your cabinet than you should. Something would break, either the plates, cabinet, or both. The same sort of breakage occurs in software, but compilers and interpreters can optimize code to prevent it.

Function Calls, Stack Frames

Most code is, like, when a function A calls B calls C, and intermediate state may need to be stored for each call.

(... A(...) {
  return (... B(...) {
    return (... C(...) {
      return ...
    })()
  })()
})()

For the above pseudocode, the best-case scenario is that a compiler does not need to allocate memory for any stack frames whatsoever, as determined by several clever algorithms. An average-case scenario might be if only one stack frame was needed due to a register allocation algorithm, or if frame size was dynamically reduced based on need, shrinking, etc. But, what’s the worst-case scenario for a C-like call stack?

  1. A static standard-size frame is pushed to the stack for A, which includes local variables, a stack of operands, constants, and references or pointers.
  2. A frame is pushed to the stack for B.
  3. A frame is pushed to the stack for C.
  4. C’s frame is popped off the stack.
  5. B’s frame is popped off the stack.
  6. A’s frame is popped off the stack.

The worst-case in this typical scenario is just three stack frames, but with recursion it can be far more.

Stack Versus Heap

Obviously the allocation of three stack frames is not a big deal. That’s just as trivial as stacking some plates in a cabinet. But if you have too many plates, why, you may have to stack them on the kitchen floor. That encroaches on the space you usually use for an entirely different purpose. The floor is supposed to contain cats and random puddles of water for you to step on with your socks — not plates.

Three worlds by M.C. Escher

“Three Worlds” by M. C. Escher. Lithograph, 1955.

The computing analogue for the place where the call stack should not go is the heap, which is where global objects (as far as a process or thread is concerned) are stored, irrespective of whatever functions are running at any given time. It’s an offshoot of an abstract data type (ADT) called the tree, which may be pictured like a tree in nature, except upside-down because computer scientists dig M.C. Escher's art, such his iconic Three Worlds print.

Overflowing & You

Sometimes the stack doesn’t want to share anymore, and it’s generally your fault.

In C, a stack buffer overflow is caused by illegally writing to a memory location of the heap, which oftentimes results in a segmentation fault (segfault) during subsequent data access. Java’s different on this matter. It’s doubtful you’ll encounter a segfault in Java unless you really try, because the Java Virtual Machine (JVM) manages pre-determined stack sizes for safety; although, the official specification permits stack dynamism, opening up the possibility for an OutOfMemoryError with an out-of-control call stack. All things considered, bugs in Java itself, or weirdness on a system running the JVM, could result in a segfault.

Let me stress that the buffer overflow is a major problem. It’s more than a minor annoyance to end-users. Some of the most infamous security exploits took advantage of the buffer overflow. See the following worms: Morris, Code Red, Blaster, and SQL Slammer. In May of 2015, there was the NetUSB stack buffer overflow vulnerability that left millions of routers at risk. While this is predominately an issue for native code, especially old code written in C, consider a modern, public API, even one running on the memory-managed Java platform. Without rate limiting, a disgruntled citizen of the world could easily target an endpoint suspected of executing stack-eating methods, and pull off a textbook DoS attack.

It’s been my opinion for a long time that recursion is the best way to cause a stack buffer overflow; see in Java:

public class InfiniteRecursion {
  public static void recurse() {
    recurse();
  }
}

Obviously there is no programmatic condition defined in recurse for it to terminate itself, so assuming recurse is called, the JVM inescapably steps in after an entire stack space is exhausted to announce there’s been a StackOverflowError.

Now, I’m being unfair. Some, citing "elegance," claim all kinds of traversals should be recursive. Still, while recursion is a wondrous construction of mathematical beauty, it can be noticeably detrimental to end-users in software. Wanna see some terrible Java illustrating the physical limits of recursion? Of course you do.

public class AtrociousStackProfiler {
  private static int recurse(int n) {
    if (n < 1) return n;
    return recurse(n - 1);
  }

  public static void main(String... args) {
    int i = 0;

    for (;;) {
      try {
        recurse(++i);
      } catch (StackOverflowError e)
        break;
      }
    }

    System.out.println("Performed " + --i + " recursive calls prior to a StackOverflowError.");
  }
}

This code recurses until it causes a stack overflow, and it prints the number of calls it took to do so. The beauty of the code is anyone can run and abhor it. There are a couple variables in there without semantically meaningful names, just because. Note the try...catch in the loop. Oh, and did you see the use of System.out.println with non-performant string concatenation, rather than using StringBuilder or a proper logging framework? Please, feel free to plagiarize this code and make it even worse.

Upon running this digital detritus, my desktop computer’s allocated thread-per-run averaged ~62,750 calls prior to death. Naturally, I wouldn’t jump to the conclusion that there were necessarily ~62,750 active stack frames. The number of active stack frames ultimately depends on the Java bytecode compiler, and JVM optimization algorithms. But it is plausible to me, especially since we’re talking about Java and the JVM here (I mean, you know), that nothing of particular interest was optimized. Note I used the default frame size during my “experiment,” which is 128KiB.

~62,750 * 128KiB = ~8GiB

~8GiB happened to be nearly all of my remaining memory at the time of the runs, and that is out of 16GiB total, since apparently other production apps take what they can get. Maybe my horrendous excuse for a stack frame profiler isn’t so bad after all. Still, now I have sufficient reason to conclude I did in fact use ~62,750 stack frames, one-to-one with recursive calls. I feel sinful.

Tail Calls & Recursion

The clever (and troubled) programmer will have noticed the recurse method is tail-recursive, meaning the last operation is a recursive method call, which implies it was rife for optimization at compile- or run-time. Putting things as plainly as I can, “last operation” does not mean last on the right-hand side, it literally means the last operation executed. Example: return 12 + recurse(--n); does not qualify.

If recurse were optimized in all its tail-recursive glory, then I would not expect a stack overflow complaint. In fact, I’d expect the program to run smoothly until I forcibly shut down my computer during an update like I normally do, or the power going out in a thunderstorm like it usually does. Among other concerns, you might be wondering why the program should run indefinitely.

Whenever a function’s last operation is a function call (including the possibility of a call to itself), that’s a tail call. Probably not the tail call you wanted, but beggars can’t be choosers. Tail calls can be optimized in various ways. These are called tail call optimizations. A tail-recursive call is a type of tail call that has an astounding optimization: The possibility to reduce 90 bajillion stack frames to one, as in O(N) to O(1), which is a type of tail call elimination (TCE). Sounds helpful, doesn’t it? Well, it is. And guess what? Java and the JVM do not support TCE, but that’s okay, because you can perform the optimization yourself.

Compile-Time Optimization

Many popular languages do not optimize tail-recursive calls at compile- nor run-time for auxiliary space, but you have the power to do it in your code. JVM-based languages such as Kotlin and Scala support TCE at compile-time, which the programmer can hint at via a tailrec keyword.

Scheme supports tail-recursive call optimization by design, and a proposal has been in work for WebAssembly. Recursive functions or not, there are currently two major strategies for TCE at our immediate disposal:

  1. Trampolines, where a function returns a new object of continuation, with some kind of indicator for completion (it's called trampolining because the call stack goes up and down, rather than up indefinitely).
  2. Imperative (or impure) functional objects (IFOs), where a function is converted into a mutable object.

My sister and I jumping on the trampoline as children

My sister and I on our trampoline in Arlington, Washington. December 27th, 1999.

Trampolines are fun, but IFOs are more efficient as determined by Tauber et al. in 2015. Furthermore, Madsen, Zarifi, and Lhoták arrived at the same conclusion in 2018. I suppose because the best solution involves an impure, global (class-wise) object, it may not be ideologically palpable for all of us, but sometimes we have to lower our standards.

Here’s an IFO implementation:

public class IfoExample {
  private static Recurse recurse = new Recurse();

  private static class Recurse {
    private int argument, result;

    public boolean apply() {
      if (argument < 1) {
        result = argument;
        return false;
      }

      result = argument - 1;
      return true;
    }
  }

  public static void main(String... args) {
    recurse.argument = 3735928559;
    recurse.apply();

    do {
      recurse.argument = recurse.result;
      System.out.println("Performed " + recurse.result + " \"recursive\" calls.");
    } while (recurse.apply());
  }
}

Instead of a recurse method, there is a Recurse inner class definition. Its recurse.apply method is practically equivalent to recurse in the last code example. Recurse is instantiated as recurse (just to make this as confusing as possible) and managed by the caller, which happens to be main. The main method seeds recurse.argument with a magic number, and recurse.apply is called once to derive a result from the seeded value. Keep in mind recurse.argument is effectively same as n from the prior example.

Finally, there is a feedback loop of setting recurse.argument with the prior recurse.result set by recurse.apply until the “base case” is reached (false is returned since there’s no reason to continue).

A StackOverflowError does not need to be caught in this demonstrational (and variant) IFO implementation. Beyond that, I initialized recurse.argument to 3,735,928,559 (DEADBEEF in hexadecimal) as an advertisement of IFO's capabilities, processing vastly beyond the ~62,750 prior call limit, suggesting O(1) auxiliary space. It’s actually faster as well, because there are no clock cycles dedicated to stack memory management. Run the code yourself. If you’re feeling adventurous, a less arbitrary and more useful application would be to explore a tree with hundreds of thousands of vertices.

Another qualification of significant importance is the IFO implementation above is not thread-safe. Recognize, also, each thread owns its own call stack. Madsen et al. introduced a “contextual” IFO that is thread-safe, performs better than the trampoline, and in some cases even performs better than the original gangster, single-threaded IFO. The thrilling implementation details involve passing a context object to apply.

Runtime Optimization

The runtime or execution optimization results from the compile-time code changes already discussed, among less widely used alternatives. Machine code will be executed either way, often proxied by human-readable assembly language. Importantly, machine code and assembly have direct access to processor registers. These are locations that may contain data values and/or memory addresses. Register reads and writes are significantly faster than that of primary storage, and absurdly faster than secondary storage. Operating systems tend to abstract both of these with virtual storage, but, no matter what, storage is finite. Fortunately, x86–64 architecture introduced more registers than what x86 has to offer. Today, there are a lot of registers with a lot of bits.

The refined C compiler will use those registers until it runs out of them, and then it will opt for the stack. x86–64 processors have what is called a memory-management unit (MMU). It will retain as much memory as possible in the processor’s cache for celerity, but once the cache is exhausted, virtual storage is next up, specifically peripheral random-access memory (RAM) if there’s any to spare. Basically, you don’t need to flip bits anywhere other than registers, until you do. Despite having more than enough registers to do the job, pretend like you don’t in this contrived nasm example:

;run on Linux with: nasm -felf64 contrived.asm && ld contrived.o && ./a.out

global _start

section .text

_start:
  mov rax, 0xDEADBEEF
  call recurse
  call exit

recurse:
  cmp rax, 0
  jg eatTheStack
  ret

eatTheStack:
  push rax
  dec rax
  call recurse
  pop rax
  ret

exit:
  mov rax, 60
  syscall

A naughty compiler would produce something like that. When does a basic recursive decrementation function compile into a stack-eating monstrosity that causes a segmentation fault? When there’s no recursive tail call elimination, of course.

With tail call elimination, the output is simpler:

;run on Linux with: nasm -felf64 not-recursion.asm && ld not-recursion.o && ./a.out

global _start

section .text

_start:
  mov rax, 0xDEADBEEF
  call recurse
  call exit

recurse:
  dec rax
  cmp rax, 0
  jg recurse
  ret

exit:
  mov rax, 60
  syscall

Note the control flow mechanism called jump. This means to jump from one place to another, and it comes in two flavors: unconditional and conditional. Unconditional is easy to remember. There’s one instruction for it: jmp. Conditional is more complicated, because it requires a combination of instructions. First, test or cmp is used to compare two arguments, and second, well, uhh, I’ll refer to this deluge of conditional jump combinations.

Long story short, rather than eating the stack, I chose to just repeatedly decrement the same register, rax, not haphazardly pushing and popping it by jumping around with jg (jump if greater than). This is faster, conserves more space, and there are fewer instructions altogether.

Conclusion

You need to be aware that recursion can blow up the call stack. If your choice language or platform does not support TCE, the bounds of your stack memory usage are unknown, and you're recursing, then you can implement a trampoline or IFO in your user code.

Or you could just write iterative code instead, lol.

© Reese Schultz

My code is released under the MIT license.