04 6 / 2013

In 1998, AMD introduced the 3DNow! instruction set. It is an extension to the already existing MMX instruction set, which only supports SIMD integer operations. With 3DNow!, programmers could now perform operations on multiple IEEE-754 floating point numbers at once, enhancing the performance of many common calculations in 3D graphics over the old and clunky x87 FPU found in most CPUs at the time. MMX with 3DNow! extensions became known as the first installment of Intel’s SSE instruction set.

Fast-forward 15 years. I’m working on a 2.5D game engine that performs many calculations in 2D space. My platform of choice happens to be OS X, but I was having a very peculiar problem: The event loop would keep getting stuck, unless I was manually sending mouse movement or keyboard events. I used Xcode with LLDB to discover that my program was indeed hanging somewhere inside Cocoa in the method that takes care of receiving UI and I/O events from the operating system ([NSApp nextEventMatchingMask:untilDate:inMode:dequeue:]), which seemed odd, because the code was being instructed to return immediately in case no events were found, and the code resides in Apple’s AppKit.framework, which is essentially used by every single UI application on OS X, so it seemed unlikely that there would be a bug in AppKit.

Inspecting the hanging program in the debugger revealed nothing suspicious. No locks being held, no intervening threads, no nothing. So there was only one way forward: Good ol’ binary search debugging. In the end, I discovered that a call to my floating point math library made the difference between a working and a non-working program. Specifically, this function:

!prettify lang-cpp
inline __m64 abs(__m64 v) {
    __m64 absval = _mm_andnot_si64(SIGNMASK_VEC2, v);
    return absval;
}

__m64 is the vector type defined by MMX, which holds two 32-bit integers or floats, or four 16-bit integers. _mm_andnot_si64 is an SSE intrinsic that negates its first argument and then performs bitwise AND of the result and the second argument — in this case, I’m using it to mask out the sign bits of the argument, resulting in a 2-vector in which all elements are positive.

Looks innocuous, right? Yet, if I removed the contents of this function, my program started “working” (i.e., a lot of math would be wrong, but the event loop would hum along without incident). Surely, performing a bit of floating-point math should have absolutely no bearing on something as high-level as the application’s event loop?

It turns out that this little 15-year-old intrinsic, along with its friends, cause all kinds of havoc within Apple’s FoundationKit/Cocoa classes, when compiled for x86-64. Unfortunately, these frameworks are all black boxes when it comes to debugging, so it is difficult to say exactly why these instructions are problematic, but it seems that executing any instruction pertaining to the MMX/3DNow! type __m64 causes weird things to happen. In my case, I’m using SDL, and calling SDL_Init on OS X initializes the relevant Foundation classes. If an __m64 instruction is executed before SDL_Init, a crash occurs deep inside Apple’s frameworks, seemingly due to a corrupt x87 FPU stack, although faulty error reporting is a strong possibility. Replacing __m64 calculations with the equivalent operations on __m128 vectors seems to fix it.

Now, MMX/3DNow! is officially deprecated on the x86-64 architecture, so it is understandable that problems occur. But there is currently no way to instruct the compiler to avoid these instructions, and they cannot be selectively included because many basic SSE instructions (such as _mm_add_ps, which adds two 4-component float vectors) are defined alongside the problematic intrinsics.

The correct solution would be to have the compiler replace the intrinsic at compile-time, transparently replacing all __m64 operations with __m128 operations, or even disallowing implicit conversions from vector types of size 2×float to __m64, which is defined as 2×int, but the current version of Clang (clang version 3.2 (tags/RELEASE_32/final)) does neither as of this writing. Even better would be to allow usage of 128-bit SSE while disallowing the use of 64-bit MMX when compiling for x86-64.

Furthermore, it took me a good few hours to hunt down this bug, and I couldn’t find a single piece of documentation. Hopefully, this post will save others from having to go through the same process.