Daniel Lemire’s post about data latency in modern CPUs got me thinking.
It’s been common knowledge for a long time now that computation latency is rarely a bottleneck. Most frequently, optimization in the modern world is about data — getting it to the right place at the right time.
Still, one of the more complex parts of modern compilers is register allocation. It’s a problem that’s both hard and interesting to solve for compiler architects, including myself: How do we ensure that as much of the data as possible is available in registers, so we store as little as possible on the stack, thereby avoiding memory access?
Yet, the latency between memory that’s in L1-cache (as is mostly the case with the stack, since it’s so frequently accessed) and registers is quite negligible, compared with the latency for accessing main memory or even L2-cache (See Every Programmer Should Know These Latency Numbers).
So it’s clear that the most important optimizations are not the micro-fiddling with instruction order and register allocation. It’s memory access planning.
Compilers do a non-trivial amount of work to accomodate this reality. GCC’s Graphite is an attempt to solve this problem, and the results are mixed so far (I was unable to find any hard numbers showing a clear general performance benefit in using Graphite, but I may simply fail at Google-fu). The best any of them can do, however, is to make guesses as to what the optimal performance characteristics of the cache structure on each CPU is going to be.
We’re manually fiddling with registers, and the impact of said fiddling is limited, at least for integer-like arithmetics, while the performance bottleneck is often in suboptimal cache usage patterns. Most current CPU architectures provide extremely limited L1 cache control (often at prefetch mechanism, and sometimes an option to simply flush the cache). This is understandable, because current architectures are designed with the assumption that L1 caches are filled according to the CPU’s guesses and estimates about memory access patterns.
Would it be crazy to suggest that compilers might as well be able to provide the CPU with a much more detailed and accurate picture of the program’s memory access patterns?
prefetch instruction (as provided by x86-64 and others) is great, but
severely underused. This is also due to the fact that current programming languages
make it extremely difficult to reason about memory access patterns without explicit
cues from the programmer (C, C++, I’m looking at you). New programming languages
could take this aspect into account, for instance by allowing their type systems
to specify usage as well as layout (“Alright, so it’s an array, but are you going
to access it randomly or linearly?”).
It may simply have eluded me, but I’ve never seen a
prefetchw instruction in
disassembly of optimized code in the wild. The
prefetchw instruction is used
to prefetch memory that the program expects to write to.
Every time the operating system decides that it’s time to run a different
process, all caches are invalidated (on account of the virtual memory system).
Future architectures could allow the compiler
to insert hints in the code about when this should take place. Consider a fictional
yield, which checks a privileged flag set by the OS, and if
it is set, it causes an immediate process switch in order to reduce the chance
of a process switch, that would otherwise happen in the middle of
performance-sensitive code. This would allow OS schedulers to be smart about
their job, without forcing full-on cooperative multitasking.
Finally, realizing that L1 access is very fast and registers are limited, compilers could defer stack usage in favor of a reserved area of program memory for temporary values, that has a very high likelihood of always being in cache. Current compilers use the stack to store everything that doesn’t fit in registers (“register spill”), and that’s sensible enough, but the stack is function-local. For temporaries that don’t need to persist past calls to other functions, some performance could be gained by avoiding cache misses on stack memory that isn’t going to be used after the function returns anyway. In other words, the compiler may allocate a bunch of “virtual registers”, residing in a program section of about the size of an average L1 cache (32-128K), that it may use as if they were caller-preserved registers. I’m not sure if any research has been done in this, but I suspect it may provide at least a marginal speedup.