Cache Lines Are The New Registers
Daniel Lemire’s post about data latency in modern CPUs
got me thinking.
It’s been common knowledge for a long time now that computation latency is rarely
a bottleneck. Most frequently, optimization in the modern world is about data —
getting it to the right place at the right time.
Still, one of the more complex parts of modern compilers is register allocation.
It’s a problem that’s both hard and interesting to solve for compiler architects,
including myself: How do we ensure that as much of the data as possible is available
in registers, so we store as little as possible on the stack, thereby avoiding
Yet, the latency between memory that’s in L1-cache (as is mostly the case with
the stack, since it’s so frequently accessed) and registers is quite negligible,
compared with the latency for accessing main memory or even L2-cache (See
Every Programmer Should Know These Latency Numbers).
So it’s clear that the most important optimizations are not the micro-fiddling
with instruction order and register allocation. It’s memory access planning.
Compilers do a non-trivial amount of work to accomodate this reality. GCC’s Graphite
is an attempt to solve this problem, and the results are mixed so far (I was
unable to find any hard numbers showing a clear general performance benefit in
using Graphite, but I may simply fail at Google-fu). The best any of them can do,
however, is to make guesses as to what the optimal performance characteristics
of the cache structure on each CPU is going to be.
Isn’t That Kind Of Dumb?
We’re manually fiddling with registers, and the impact of said fiddling is limited,
at least for integer-like arithmetics, while the performance bottleneck is
often in suboptimal cache usage patterns. Most current CPU architectures provide
extremely limited L1 cache control (often at prefetch mechanism, and sometimes
an option to simply flush the cache). This is understandable, because current
architectures are designed with the assumption that L1 caches are filled
according to the CPU’s guesses and estimates about memory access patterns.
Would it be crazy to suggest that compilers might as well be able to provide
the CPU with a much more detailed and accurate picture of the program’s memory
Observations and Ideas For CPU and Compiler Architects
prefetch instruction (as provided by x86-64 and others) is great, but
severely underused. This is also due to the fact that current programming languages
make it extremely difficult to reason about memory access patterns without explicit
cues from the programmer (C, C++, I’m looking at you). New programming languages
could take this aspect into account, for instance by allowing their type systems
to specify usage as well as layout (“Alright, so it’s an array, but are you going
to access it randomly or linearly?”).
It may simply have eluded me, but I’ve never seen a
prefetchw instruction in
disassembly of optimized code in the wild. The
prefetchw instruction is used
to prefetch memory that the program expects to write to.
Every time the operating system decides that it’s time to run a different
process, all caches are invalidated (on account of the virtual memory system).
Future architectures could allow the compiler
to insert hints in the code about when this should take place. Consider a fictional
yield, which checks a privileged flag set by the OS, and if
it is set, it causes an immediate process switch in order to reduce the chance
of a process switch, that would otherwise happen in the middle of
performance-sensitive code. This would allow OS schedulers to be smart about
their job, without forcing full-on cooperative multitasking.
Finally, realizing that L1 access is very fast and registers are limited,
compilers could defer stack usage in favor of a reserved area of program memory
for temporary values, that has a very high likelihood of always being in cache.
Current compilers use the stack to store everything that doesn’t fit in registers
(“register spill”), and that’s sensible enough, but the stack is function-local. For
temporaries that don’t need to persist past calls to other functions, some
performance could be gained by avoiding cache misses on stack memory that isn’t
going to be used after the function returns anyway. In other words, the compiler
may allocate a bunch of “virtual registers”, residing in a program section of
about the size of an average L1 cache (32-128K), that it may use as if they were
caller-preserved registers. I’m not sure if any research has been done in this,
but I suspect it may provide at least a marginal speedup.