Contiguous stacks in Go
Lately I’ve been messing with Go and I really like it. The 1.3 release is due in June 2014 and among other performance improvements, the runtime will implement contiguous stacks. Let’s see what this is about.
The 1.2 runtime uses segmented stacks, also known as split stacks.
In this approach, stacks are discontiguous and grow incrementally.
Each stack starts with a single segment. When the stack needs to grow, another segment is allocated and linked to the previous one, and so forth. Each stack is effectively a doubly linked list of one or more segments.
The advantage of this approach is that stacks can start small and grow or shrink as needed. For example they start at 4kb in 1.1 and 8kb in 1.2.
However a certain issue arises.
Imagine a function call happening when the stack is almost full. The call will force the stack to grow and a new segment will be allocated. When that function returns, the segment will be freed and the stack will shrink again.
Now imagine that call happening very frequently:
The call to
big() will cause a new segment to be allocated, which will be freed upon return. Only this will happen repeatedly since we are inside the loop.
In such cases where the stack boundary happens to fall in a tight loop, the overhead of creating and destroying segments repeatedly becomes significant. This is called the “hot split” problem inside the Go community. Rust folks call it “stack thrashing”.
The “hot split” will be addressed in Go 1.3 by making the stacks contiguous.
Now when a stack needs to grow, instead of allocating a new segment the runtime will:
- create a new, somewhat larger stack
- copy the contents of the old stack to the new stack
- re-adjust every copied pointer to point to the new addresses
- destroy the old stack
What’s interesting here is the re-adjusting part.
As it turns out, that part heavily depends on the compiler’s escape analysis which provides us with an important invariant: the only pointers to data in a stack, are in that stack themselves (there are some exceptions though). If any pointer escapes (eg. the pointer is returned to the caller) it means that the pointed data are allocated on the heap instead.
There is a certain challenge though. The 1.2 runtime doesn’t know if a pointer-sized word in the stack is an actual pointer or not. It is possible that there are floats and most rarely integers that if interpreted as pointers, would actually point to data.
Due to the lack of such knowledge the garbage collector has to conservatively consider all the locations in the stack frames to be GC roots. This opens the possibility for memory leaks, especially on 32-bit architectures since their address pool is much smaller.
When copying stacks however, such cases have to be avoided and only real pointers should be taken into account when re-adjusting.
Work was done though and information about live stack pointers is now embedded in the binaries and is available to the runtime. This means not only that re-adjusting all stack pointers is now possible, but also that the collector in 1.3 can precisely collect stack data.
The initial stack size will be conservatively set to 4kb in 1.3 but will probably be further reduced in 1.4. As for the shrinking strategy, when the collector runs, stacks using less than 1/4 of their size will be shrinked by half.
Even though contiguous stacks cause a trivial amount of memory fragmentation, benchmarks comparing both implementations using
html/template show some nice performance improvements.