Atomic Stack (Original)

[This page was originally written on June 8, 2010. I've since made improvements to the "stack" and added a "queue". See Wait Free.]

Atomic operations are the basis for dealing with contention between asynchronously executing code. In theory atomic operations are fairly easy; in practice their application is very difficult. Traditionally locks, semaphores, and mutexs are implemented with some sort of underlying atomic operation, such as test-and-set-lock.

The most common operations are atomic_increment (and decrement), and atomic_compare_and_swap. Many algorithms are written using these. Unfortunately the better form of atomics that uses lock line reservations and conditional updates is rarely available. I'll go into why they are better in a bit.

So why is it so hard to get things right? For starters hardware varies widely, with cache lines of differing sizes and levels and instructions sets that expose more or less bytes that can be handled atomically. It gets much worse one you start to consider optimizing C/C++ compilers that will often reorder instructions and elide copies in ways that will surprise you and break your code, and adding volatile everywhere will cause them to produce horribly inefficient code. Ultimately though, we humans are just not that good at thinking through all that interactions that are happening, and how they can go wrong.

So lets say you a machine where sizeof(void*) is 4 (i.e. a 32-bit machine), and you have bool atomic_compare_and_swap_ptr(volatile intptr_t *address, intptr_t const& compare_to, intptr_t const& swap_with) that works. What can you do with it? Lets try and build a stack. (We'll fail, but how we fail is important.)

struct stack_entry {
  stack_entry volatile* next;
  char data;
};
typedef stack_entry volatile** stack;
void stack_push(stack s, stack_entry * n)
// s is the address of a pointer to the top of the stack
// n is the entry to place on top of the stack
{
    for(;;) // keep trying until the CAS succeeds
    {
        n->next = *s; // update n's next to point to the current top
        if (atomic_compare_and_swap_ptr(s, (uintptr_t)*s, (uintptr_t)n)) // try to swap n with the current top
            return;
    }
}
stack_entry * stack_pop(stack s)
{
    for(;;) // keep trying until the CAS succeeds
    {
        stack_entry *r = *s; // top
        // [1]
        if (!r) return r; // underflow, no next pointer since stack is null
        if (atomic_compare_and_swap_ptr(s, (uintptr_t)*r, (uintptr_t)(*r)->next)) // swap top with next
            return r; // return original top
    }
}
// the above may not be exactly correct, but it's wrong and should never be used.

This results in the a-b-a problem. Lets say our stack looks like this "abc" (i.e. "a" is on top followed by "b" then "c") and we've got two threads accessing the stack. Now thread 0 comes along and starts a pop operation but gets suspended at [1] in the code above; r points to "a" with a next of "b". Now thread 1 comes along and pops off "a" and "b" and pushes "a" back on, now with a next of "c". Eventually thread 0 wakes up and the atomic_compare_and_swap_ptr is made, and succeeds in swapping the top with "b", which is completely wrong. Because thread 1 restored the top pointer to the same value it was when thread 0 was suspended thread 0 didn't see the change to the next pointer.

(Note: writing atomic_compare_and_swap_ptr(s, (uintptr_t)*r, (uintptr_t)(*s)->next) doesn't fix the problem. It just moves it to having to suspend thread 0 during setting up the function call, after the third parameter is resolved and before the dispatch occurs.)

(Also note: The loops don't yield, and as far as I know there is no portable way to yield. You'll likely have to yield at some point though (perhaps in the caller), since if you don't a consumer thread spinning on stack_pop waiting for a producer thread to stack_push could completely block out the producer thread if your priorities are set wrong and you operating system doesn't eventually give time to lower priority threads; say like on a PS3.)

So can we avoid the a-b-a problem? Unfortunately not with atomic_compare_and_swap_ptr (at least if we want to keep using pointers directly). We need either larger atomics, or lock line reservation with conditional update.

With larger atomics we can version the pointer and increment the version on every operation. That way the push will see the pointer as the same ("a"), but with a different version and so will fail. The version is often called a tag. The major draw back up this approach is that many systems don't support atomics larger than pointer size. There is also a small chance of wrap around, but that's very unlikely in real world usage unless you specifically contrive a case for it. (Then again, some of the things hackers have done to break systems are very similar.)

Ideally we'd have lock line reservation with conditional update (GetLLAR/PutLLC if you're familiar with the PS3-SPU) everywhere rather than just on a small number of platforms. Atomics are generally implemented using something like them under the hood anyways. Basically each cache line has a reservation marker indicated which core last asked for a reservation. Any writes to the cache line clear it, even if the data written is identical to what was already there. The conditional update checks the reservation marker matches the core updating it before writing to the cache line and only writes if it is still held. During a thread switch the core has to clear outstanding reservations so only a limited number of cache lines can be reserved at a time. In practice you should only ever rely on getting one reservation.

The major reason I prefer cache line reservations is it better reflects what really happens and doesn't limit use to 32, 64, or 128 bit atomics when the cache lines are 512 or 1024 bits in size. Since the smallest amount of real memory that can be updated is a cache line and inter cpu synchronization happens at a cache line level all atomic updates actually happen at the cache line granularity anyways, so why not expose it.

For now we'll have to live with versioning pointers on system that allow larger than pointer sized atomics and avoid using pointer indexing on system with pointer sized atomics. (Basically artificially reducing pointer space by using an index into an array and adding a version into the bits freed up.)

Now at first you'll be tempted to use a union to overlay an integer type over the pointer and version, do not do this. All access to atomics in the heap must go through a type marked as volatile, and making a union volatile is a nightmare. The compiler dependent treatment of the volatile keyword is going to cause us enough grief before we're done.

To start with we'll need a pointer + version pair that most compilers will do-the-right-thing with. That basically means it has to be an integer type, not an aggregate, since most compiles will place integer types in a register whenever possible. This presents a problem on 64bit machines since there is no standard 128 bit integer type. That means we'll have to use macros for manipulating the type so we can use compiler and sdk specific code for manipulating the MSB and LSB parts of the integer and clearing it to zero.

// 32 bit pointers on a machine with 64 bit atomics
#if HAVE_POINTER32_ATOMIC64
    #define atomic_type uint64_t // note: most platforms also require an alignment here
    #define atomic_clear(x) (x=0)
    #define atomic_ver(x) ((uint32_t)((x) >> 32))
    #define atomic_ptr(x) ((uint32_t)((x) & 0xFFFFFFFFull))
    #define atomic_set(x,ver,ptr) (x=(((uint64_t)ptr & 0xFFFFFFFFull) | (((uint64_t)ver & 0xFFFFFFFFull) << 32)))
    #define atomic_compare_and_swap_verptr(addr,compare_to,swap_with) // fill in with SDK call
    #define atomic_put_ptr(addr,value) // fill in with SDK call
    #define atomic_get_ptr(addr) // fill in with SDK call
    #define atomic_get_verptr(addr) // fill in with SDK call
#elif HAVE_POINTER64_ATOMIC128
    // one possible implementation of many compiler dependent ones
    struct atomic_int128_t { __int64 i64[2]; } // note: most platforms also require an alignment here
    #define atomic_type atomic_int128_t
    #define atomic_clear(x) (x.i64[0]=0,x.i64[1]=0)
    #define atomic_ver(x) ((uint64_t)x.i64[1])
    #define atomic_ptr(x) ((uint32_t)x.i64[0])
    #define atomic_set(x,ver,ptr) (x.i64[0]=(int64_t)ptr,x.i64[1]=(int64_t)ver)
    #define atomic_put_ptr(addr,value) // fill in with SDK call
    #define atomic_get_ptr(addr) // fill in with SDK call
    #define atomic_get_verptr(addr) // fill in with SDK call
#elif ... // other cases
#endif

We can now build an atomic stack class (or free functions if your a C fan):

class atomic_stack
{
private:
    atomic_type volatile top_;
public:
     // some compilers generate empty ctor and dtor rather than eliding them.
     FORCE_INLINE atomic_stack() {}
     FORCE_INLINE ~atomic_stack() {}
     FORCE_INLINE void clear() {
         atomic_clear(top_);
     }
     FORCE_INLINE void push(void * item, size_t offset_to_next_pointer) {
        uintptr_t volatile * const next =
            (uintptr_t volatile *)(((char*)item) + offset_to_next_pointer);
        for(;;) {
            atomic_type original_value = atomic_get_verptr(&top_);
            // item's next pointer to top of stack
            atomic_put_ptr(next, atomic_ptr(original_value));
            atomic_type new_value;
            // next head is same version, points to item
            atomic_set(new_value, atomic_ver(original_value), item);
            // note: we don't bump version in push, see notes.
            COMPILER_READ_WRITE_BARRIER
            if (atomic_compare_and_swap_verptr(&top_, original_value, new_value))
                return;
            YIELD
        }
     }
     FORCE_INLINE void * pop(size_t offset_to_next_pointer) {
        for(;;) {
            atomic_type original_value = atomic_get_verptr(&top_);
            void * item = atomic_ptr(original_value);
            if (!item) return 0;
            uintptr_t volatile * const next =
                (uintptr_t volatile *)(((char*)item) + offset_to_next_pointer);
            atomic_type new_value;
            atomic_set(new_value, atomic_ver(original_value)+1, atomic_get_ptr(next));
            COMPILER_READ_WRITE_BARRIER
            if (atomic_compare_and_swap_verptr(&top_, original_value, new_value))
                return item;
            YIELD
        }
     }
    private:
        // don't allow copies
        atomic_stack(atomic_stack const&);
        atomic_stack& operator=(atomic_stack const&);
};

I use FORCE_INLINE on push and pop. It is likely that the atomic_stack will be used as the basis for a more complicated data access pattern that will expose a few functions that use push and pop; so the code bloat won't be significant. It is unfortunate that C++ doesn't allow for the inlining to be specified at the call site and so we have to inline by default and rely on the user deciding to wrap the calls in a non-inlined function or method to reduce code bloat if they intend to call the push and pop methods from many places in their code. If you do use push and pop from several call sites you should consider not inlining them to reduce code bloat.

We only bump the version in pop since an item in the stack should never be re-added to the stack. You might also notice that we're added a compiler read-write barrier before the compare and swap call. That ensures the store to the next pointer of the item doesn't get reordered to after the compare and swap call. (Some compilers don't provide this, or don't honor it properly, for example the XBox360 compiler requires additional volatiles to get the order correct that also generate unnecessary load-hit-stores and cost performance.)

I use a clear method rather than having the constructor (ctor) do the initialization. I aim to have my classes be valid when initialized to all zeros so my allocator can efficiently zero out the memory so often the clear call goes unused.

At this point some of you are probably wondering how we're going to test this. One major problem with developing any program is that what we write and what the compiler generates are different things. I've already alluded to this by adding in macro that we need to replace with a compiler specific directive to prevent reordering of stores. Different compilers and architectures present different optimization techniques and many of them assume a singled threaded program. Reordering of loads and stores is just one problem.

Take this code for instance:

void f(int32_t * addr)
{
    int32_t const temp = *addr;
    int32_t const temp2 = temp + 1;
    *addr = temp2;
}
// somewhere where f is used:
    ...
    int32_t i;
    ...
    f(&i);
    ...

The call to f(&i) will likely be assembled into something like this:

    ;; risc-like r0 contains the address of i
    ld   r1, r0
    ai   r1, r1, 1
    st   r1, r0

That is to say no call will be made (it'll be inlined), and all of the temporary variables will be elided. If the machine is capable of acting on memory directly it may even boil down to a single instruction that increments the value in memory rather than a function call.

Now, if we consider the impact this has on our atomics code above you can see why barriers alone are not enough:

void f(int32_t * addr)
{
    for(;;)
    {
        int32_t const original_value = *addr;
        int32_t const new_value = original_value + 1;
        if(atomic_compare_and_swap(addr, original_value, new_value))
            return;
    }
}

can easily turn into:

L0:
    ld    r1, r0
    ai    r2, [r0], 1
    brsl  atomic_compare_and_swap
    bnz   r3, L0

instead of:

L0:
    ld    r1, r0
    ai    r2, r1, 1
    brsl  atomic_compare_and_swap
    bnz   r3, L0

That is the compiler might assume it can read the value from memory twice and get the same result. (Hence we mark the memory as volatile to indicate to the compiler that the memory can be changed by operations not executing in the current thread.) I've actually had a compiler generate code like this.

Now, given that we can't completely control the assembly code generated by the compiler we've got some issues with testing our atomic_stack class. Any tests we write will have to be executed by a real machine and it is very possible the compiler for that machine will produce code that works, even if we've made an error, while a compiler for another machine would produce code that would break because of our error. Also the window for a race can be as small as a single instruction interleaving between two threads that occurs so rarely (one in billions) it doesn't occur often enough to show up regularly in our tests. There is really no way to test this code to a one hundred percent level of confidence using standard testing practices. Each machine we're going to use this code on is going to require us to decompile and analyze the assembly produced so we can be sure it works. Worse yet if we want to continue to be one hundred percent sure the code is working we'll have to do this every time the compiler changes to produce different assembly output as new optimizations are added. (Yes, we could build an emulator and simulate every conceivable instruction interleaves between several threads, but that not exactly a standard testing practice and still leaves us with having to test each machine and compiler combination anyways.) This is a really good argument for having this kind of algorithm be a built in feature of a language rather than something each user should develop.

To go further at this point we need to pick a machine and see what we get. I'm going to use Visual Studio Express 2008 on a 32bit intel machine.

To start with we'll need the defines and platform includes:

#include <stdio.h>
#include <stdlib.h>
#include <intrin.h>
typedef unsigned __int64 uint64_t;
typedef unsigned __int32 uint32_t;
#define atomic_type uint64_t // note: most platforms also require an alignment here
#define atomic_clear(x) (x=0)
#define atomic_ver(x) ((uint32_t)((x) >> 32))
#define atomic_ptr(x) ((void*)((x) & 0xFFFFFFFFull))
#define atomic_set(x,ver,ptr) (x=(((uint64_t)ptr & 0xFFFFFFFFull) | (((uint64_t)ver & 0xFFFFFFFFull) << 32)))
#define atomic_compare_and_swap_verptr(addr, compare_to, swap_with) (_InterlockedCompareExchange64((volatile __int64 *)addr, swap_with, compare_to)==compare_to)
#define atomic_put_ptr(addr,value) (*(uint32_t volatile*)addr=(uint32_t)value)
#define atomic_get_ptr(addr) (*(uint32_t volatile*)addr)
#define atomic_get_verptr(addr) (*(uint64_t volatile*)addr)
#define FORCE_INLINE __forceinline
#define YIELD
#define COMPILER_READ_WRITE_BARRIER _ReadWriteBarrier();

That gets our atomic_stack code compiling.

A small test:

#include <tchar.h>
#include <stdio.h>
#include <stdlib.h>
#include <intrin.h>
#include <process.h>
#include <windows.h>
typedef unsigned __int64 uint64_t;
typedef unsigned __int32 uint32_t;
#define atomic_type uint64_t // note: most platforms also require an alignment here
#define atomic_clear(x) (x=0)
#define atomic_ver(x) ((uint32_t)((x) >> 32))
#define atomic_ptr(x) ((void*)((x) & 0xFFFFFFFFull))
#define atomic_set(x,ver,ptr) (x=(((uint64_t)ptr & 0xFFFFFFFFull) | (((uint64_t)ver & 0xFFFFFFFFull) << 32)))
#define atomic_compare_and_swap_verptr(addr, compare_to, swap_with) (_InterlockedCompareExchange64((volatile __int64 *)addr, swap_with, compare_to)==compare_to)
#define atomic_put_ptr(addr,value) (*(uint32_t volatile*)addr=(uint32_t)value)
#define atomic_get_ptr(addr) (*(uint32_t volatile*)addr)
#define atomic_get_verptr(addr) (*(uint64_t volatile*)addr)
#define FORCE_INLINE __forceinline
#define YIELD
#define COMPILER_READ_WRITE_BARRIER _ReadWriteBarrier();
class atomic_stack {
private:
    atomic_type volatile top_;
public:
     // some compilers generate empty ctor and dtor rather than eliding them.
     FORCE_INLINE atomic_stack() {}
     FORCE_INLINE ~atomic_stack() {}
     FORCE_INLINE void clear() {
         atomic_clear(top_);
     }
     FORCE_INLINE void push(void * item, size_t offset_to_next_pointer) {
        uintptr_t volatile * const next =
            (uintptr_t volatile *)(((char*)item) + offset_to_next_pointer);
        for(;;) {
            atomic_type original_value = atomic_get_verptr(&top_);
            // item's next pointer to top of stack
            atomic_put_ptr(next, atomic_ptr(original_value));
            atomic_type new_value;
            // next head is same version, points to item
            atomic_set(new_value, atomic_ver(original_value), item);
            // note: we don't bump version in push, see notes.
            COMPILER_READ_WRITE_BARRIER
            if (atomic_compare_and_swap_verptr(&top_, original_value, new_value))
                return;
            YIELD
        }
     }
     FORCE_INLINE void * pop(size_t offset_to_next_pointer) {
        for(;;) {
            atomic_type original_value = atomic_get_verptr(&top_);
            void * item = atomic_ptr(original_value);
            if (!item) return 0;
            uintptr_t volatile * const next =
                (uintptr_t volatile *)(((char*)item) + offset_to_next_pointer);
            atomic_type new_value;
            atomic_set(new_value, atomic_ver(original_value)+1, atomic_get_ptr(next));
            COMPILER_READ_WRITE_BARRIER
            if (atomic_compare_and_swap_verptr(&top_, original_value, new_value))
                return item;
            YIELD
        }
     }
    private:
        // don't allow copies
        atomic_stack(atomic_stack const&);
        atomic_stack& operator=(atomic_stack const&);
};
class user_data {
public:
    user_data * next_; // offset 0 makes it easy
    char data_;
    int last_thread_;
    bool crumb_;
    FORCE_INLINE user_data() {}
    FORCE_INLINE ~user_data() {}
private:
    user_data(user_data const&);
    user_data& operator=(user_data const&);
};
static void fail(char const * msg, char const * file, int line) {
    printf("\nFAIL %s(%d):\n\t%s\n", file, line, msg);
    exit(-1);
}
#define FAIL(x) fail( x , __FILE__, __LINE__)
class thread_args {
public:
    atomic_stack *atomic_stack_;
    int thread_;
};
unsigned int __stdcall thread_proc(void *args) {
    thread_args *thread_args_ = reinterpret_cast<thread_args*>(args);
    atomic_stack * atomic_stack_ = thread_args_->atomic_stack_;
    int thread_ = thread_args_->thread_;
    //printf("thread %d\n", thread_);
    static size_t const swap_count = 1024*1024;
    for(size_t i = 0; i < swap_count; ++i) {
        user_data * a;
        user_data * b;
        a = reinterpret_cast<user_data*>(atomic_stack_->pop(0));
        assert(a!=0);
        b = reinterpret_cast<user_data*>(atomic_stack_->pop(0));
        assert(b!=0);
        a->last_thread_ = thread_;
        b->last_thread_ = thread_;
        //printf("%d swap %c %c\n", thread_, a->data_, b->data_);
        atomic_stack_->push(b, 0);
        atomic_stack_->push(a, 0);
    }
    _endthreadex(0);
    return 0;
}
int _tmain(int argc, _TCHAR* argv[]) {
    static size_t const thread_count = 8;
    static size_t const data_count = 2 * thread_count;
    user_data user_data_[data_count];
    for(size_t i=0; i< data_count; ++i) {
        user_data_[i].last_thread_ = -1;
        user_data_[i].next_ = 0;
        user_data_[i].data_ = 'a' + i;
        user_data_[i].crumb_ = false;
    }
    atomic_stack atomic_stack_;
    atomic_stack_.clear();
    if(atomic_stack_.pop(0)) FAIL("atomic_stack_.pop(0) returned non-null pointer.");
    atomic_stack_.push(user_data_ + 0, 0);
    atomic_stack_.push(user_data_ + 1, 0);
    user_data *d;
    d = reinterpret_cast<user_data*>(atomic_stack_.pop(0));
    if (d->data_ != 'b') FAIL("first pop should return last push.");
    printf("%c\n", d->data_);
    d = reinterpret_cast<user_data*>(atomic_stack_.pop(0));
    if (d->data_ != 'a') FAIL("last pop should return first push.");
    printf("%c\n", d->data_);
    for(size_t i = 0; i < data_count; ++i)
        atomic_stack_.push(user_data_ + i, 0);
    HANDLE thread_handle_[thread_count];
    thread_args thread_args_[thread_count];
    for(size_t i = 0; i < thread_count; ++i) {
        thread_args_[i].thread_ = i;
        thread_args_[i].atomic_stack_ = &atomic_stack_;
        uintptr_t r = _beginthreadex( 0, 0, &thread_proc, thread_args_ + i, 0, 0);
        if (r == uintptr_t(-1))
            FAIL("failed to create thread\n");
        thread_handle_[i] = (HANDLE)r;
    }
    WaitForMultipleObjects(thread_count, thread_handle_, true, INFINITE);
    for(;;) {
        d = reinterpret_cast<user_data*>(atomic_stack_.pop(0));
        if (!d) break;
        d->crumb_ = true;
        printf("%c %d\n", d->data_, d->last_thread_);
    }
    for(size_t i = 0; i < data_count; ++i)
        if(!user_data_[i].crumb_) FAIL("user_data_ escaped.\n");
    return 0;
}

Confirms functionality and tests multiple threads pushing and poping from a stack.

Now lets look under the hood and check up on the compiler. First make sure your building the above test in Release. Now place a break point somewhere in main and run it in the debugger. Right click and pick "Go To Disassembly".

The first two pushes turn into (without the annotations to the left which I've added):

    atomic_stack_.push(user_data_ + 0, 0);
00401288  lea         eax,[esp+50h]           ;; eax = address of user_data[0] ;
0040128C  cdq                                 ;; sign extend eax into edx:eax ;
0040128D  mov         dword ptr [esp+20h],eax ;; save the address of user_data[0] on the stack, if we loop we'll need it again ;
00401291  mov         dword ptr [esp+24h],0   ;; ?? ;
                                              ;; retry loop:
00401299  mov         esi,dword ptr [esp+10h] ;;  load the pointer half for top_ from the atomic stack ;
0040129D  mov         edi,dword ptr [esp+14h] ;;  load the version half for top_ from the atomic stack ;
004012A1  xor         ebx,ebx                 ;;  clear ebx to zero ;
004012A3  or          ebx,dword ptr [esp+20h] ;;  read the address of user_data[0] from the stack in ebx. ;
004012A7  mov         ecx,edi                 ;;  ecx is top_ version, ecx:ebx is new top_ value ;
004012A9  or          ecx,dword ptr [esp+24h] ;;  oring with 0, not sure why ;
004012AD  mov         dword ptr [esp+50h],esi ;;  update next pointer of user_data[0] to top_ ;
004012B1  mov         eax,esi                 ;;  atomic stack top_ pointer in eax ;
004012B3  mov         edx,edi                 ;;  pointer, edx:eax is original top_ value ;
004012B5  lea         ebp,[esp+10h]           ;;  address of top_ from the atomic stack ;
004012B9  lock cmpxchg8b qword ptr [ebp]      ;;  compare edx:eax to top_ and exchange with ecx:ebx on a match, return result in edx:eax ;
004012BE  cmp         eax,esi                 ;;  compare returned pointer to new pointer ;
004012C0  jne         wmain+0B9h (401299h)    ;;  if different try again ;
004012C2  cmp         edx,edi                 ;;  compare returned version to new version ;
004012C4  jne         wmain+0B9h (401299h)    ;;  if different try again ;
    atomic_stack_.push(user_data_ + 1, 0);
004012C6  lea         eax,[esp+60h]           ;; eax = address of user_data[1] ;
004012CA  cdq                                 ;; sign extend eax into edx:eax ;
004012CB  mov         dword ptr [esp+28h],eax ;; save address of user_data[1] on the stack, if we loop we'll need it again ;
                                              ;; retry loop: ;
004012CF  mov         esi,dword ptr [esp+10h] ;;  load the pointer half for top_ from the atomic stack ;
004012D3  mov         edi,dword ptr [esp+14h] ;;  load the version half for top_ from the atomic stack ;
004012D7  xor         ebx,ebx                 ;;  zero ebx ;
004012D9  or          ebx,dword ptr [esp+28h] ;;  read the address of user_data[1] from the stack ;
004012DD  mov         ecx,edi                 ;;  ecx is the version from top_ ;
004012DF  xor         eax,eax                 ;;  zero eax ;
004012E1  mov         dword ptr [esp+60h],esi ;;  update next pointer of user_data[1] to top_ ;
004012E5  or          ecx,eax                 ;;  version from top_, ecx:ebx is new top_ ;
004012E7  mov         eax,esi                 ;;  pointer from top_ ;
004012E9  mov         edx,edi                 ;;  pointer from top_, eda:eax is original top_ ;
004012EB  lea         ebp,[esp+10h]           ;;  address for atomic stack top_ ;
004012EF  lock cmpxchg8b qword ptr [ebp]      ;;  compare edx:eax to top_ and exchange with ecx:ebx on a match, return result in edx:eax ;
004012F4  cmp         eax,esi                 ;;  compare new pointer ;
004012F6  jne         wmain+0EFh (4012CFh)    ;;  if different try again ;
004012F8  cmp         edx,edi                 ;;  compare new version ;
004012FA  jne         wmain+0EFh (4012CFh)    ;;  if different try again ;

Oddly it generates different code for the two calls, but both are correct as we can see the next pointer is updated before the cmpxchg happens and that top_ is read only once to get both the original and new values.

Now lets look at the first Pop:

    d = reinterpret_cast<user_data*>(atomic_stack_.pop(0));
004012FC  mov         esi,dword ptr [esp+10h] ;; pointer of top_ atomic stack ;
00401300  mov         edi,dword ptr [esp+14h] ;; version of top_ atomic stack ;
00401304  test        esi,esi                 ;; this is a surprise, the compiler is doing something we didn't ask it to ;
00401306  je          wmain+16Bh (40134Bh)    ;;  and checking for a null pointer and jumping down to was_null ;
00401308  xor         edx,edx                 ;; zero edx ;
0040130A  push        1                       ;; ?? ;   
0040130C  mov         ecx,edi                 ;; ecx gets a copy of the version of top_ ;
0040130E  add         ecx,1                   ;; increment the version ;
00401311  push        0                       ;; ?? ;
00401313  adc         edx,edx                 ;; ;
00401315  push        edx                     ;; ;
00401316  push        ecx                     ;; ;
00401317  call        _allmul (401EA0h)       ;; what the? ;
0040131C  mov         ebx,eax                 ;; ;
0040131E  mov         eax,dword ptr [esi]     ;; ;
00401320  mov         ecx,edx                 ;; ;
00401322  xor         ebx,ebx                 ;; ;
00401324  xor         edx,edx                 ;; ;
00401326  or          ebx,eax                 ;; ;
00401328  or          ecx,edx                 ;; ;
0040132A  mov         eax,esi                 ;; ;
0040132C  mov         edx,edi                 ;; ;
0040132E  lea         ebp,[esp+10h]           ;; ;
00401332  lock cmpxchg8b qword ptr [ebp]      ;;  compare edx:eax to top_ and exchange with ecx:ebx on a match, return result in edx:eax ;
00401337  cmp         eax,esi                 ;; ;
00401339  jne         wmain+15Fh (40133Fh)    ;; ;
0040133B  cmp         edx,edi                 ;; ;
0040133D  je          wmain+16Dh (40134Dh)    ;; ;
0040133F  mov         esi,dword ptr [esp+10h] ;; ;
00401343  mov         edi,dword ptr [esp+14h] ;; ;
00401347  test        esi,esi                 ;; ;
00401349  jne         wmain+128h (401308h)    ;; ;
0040134B  xor         esi,esi                 ;; was_null: zero esi ;
    if (d->data_ != 'b') FAIL("first pop should return last push.");
0040134D  cmp         byte ptr [esi+4],62h    ;; compare data returned to 'b' ;
00401351  je          $LN115 (401371h)        ;; we're good, jump past the error handler ;
00401353  mov         ecx,0A1h                ;; ;
00401358  mov         eax,offset string "first pop should return last pus"... (402150h)
0040135D  call        fail (401020h)          ;; report failure ;
... _allmul is way down below
 
--- f:\dd\vctools\crt_bld\SELF_X86\crt\src\INTEL\llmul.asm ---------------------
00401EA0  mov         eax,dword ptr [esp+8]
00401EA4  mov         ecx,dword ptr [esp+10h]
00401EA8  or          ecx,eax
00401EAA  mov         ecx,dword ptr [esp+0Ch]
00401EAE  jne         hard (401EB9h)
00401EB0  mov         eax,dword ptr [esp+4]
00401EB4  mul         eax,ecx
00401EB6  ret         10h 
00401EB9  push        ebx 
00401EBA  mul         eax,ecx
00401EBC  mov         ebx,eax
00401EBE  mov         eax,dword ptr [esp+8]
00401EC2  mul         eax,dword ptr [esp+14h]
00401EC6  add         ebx,eax
00401EC8  mov         eax,dword ptr [esp+8]
00401ECC  mul         eax,ecx
00401ECE  add         edx,ebx
00401ED0  pop         ebx 
00401ED1  ret         10h 

Holy crap. The compiler sure spit out a lot of code for pop, and the _allmul call looks really fishy.

Turns out the this particular compilers is turning the atomic_set math into a 64bit multiply, and fiddling with the optimization settings doesn't help. We can write atomic_set differently in this case to avoid the problem:

#define atomic_set(x,ver,ptr) (x=ver,x<<=32,x|=(uint64_t)ptr)

This improves the code generation, atomic_push becomes:

    atomic_stack_.push(user_data_ + 0, 0);
004011E4  lea         eax,[esp+50h]            ;; eax is the address of user_data[0] ;
004011E8  cdq                                  ;; sign extend into edx:eax, see notes;
004011E9  mov         dword ptr [esp+18h],eax  ;; save addr on stack ;
004011ED  mov         dword ptr [esp+1Ch],edx  ;; save addr on stack, see notes ;
                                               ;; retry_loop: ;
004011F1  mov         esi,dword ptr [esp+10h]  ;;  load pointer from top_ ;
004011F5  mov         edi,dword ptr [esp+14h]  ;;  load version from top_ ;
004011F9  xor         ebx,ebx                  ;;  zero ebx ;
004011FB  or          ebx,dword ptr [esp+18h]  ;;  address of user_data[0] from stack ;
004011FF  mov         ecx,edi                  ;;  copy version from top_ ;
00401201  or          ecx,dword ptr [esp+1Ch]  ;;  sign extended part of address of user_data[0] from stack ! ;
00401205  mov         dword ptr [esp+50h],esi  ;;  update next_ in user_data[0] to pointer from top_ ;
00401209  mov         eax,esi                  ;;  copy pointer from top_ into eax ;
0040120B  mov         edx,edi                  ;;  copy version from top_ into edx ;
0040120D  lea         ebp,[esp+10h]            ;;  ebp is address of top_ ;
00401211  lock cmpxchg8b qword ptr [ebp]       ;;  compare edx:eax to top_ and exchange with ecx:ebx on a match, return result in edx:eax ;
00401216  cmp         eax,esi                  ;;  compare result to ecx:ebx and retry if exchange failed ;
00401218  jne         wmain+71h (4011F1h)      ;;   '' ;
0040121A  cmp         edx,edi                  ;;   '' ;
0040121C  jne         wmain+71h (4011F1h)      ;;   '' ;

It is still doing a needless, and potentially harmful, cdq on the pointer to the data to be pushed. If the memory management system returns a pointer with the high bit set our versioning will fail since the cqd will fill edx with all ones and the subsequent or of the version and the saved edx value from the stack will result in the version being all ones. This could lead to an A-B-A failure that would be very hard to track down.

Using a union we can avoid the sign extension:

#define atomic_set(x,ver,ptr) do { \
  union { struct { uint32_t l; uint32_t h; }; uint64_t v; } t; \
  t.h = ver; t.l = (uint32_t)ptr; \
  x = t.v; } while (0)

Now push becomes:

    atomic_stack_.push(user_data_ + 0, 0);
004011E0  mov         esi,dword ptr [esp+10h] ;; pointer from top_ ;
004011E4  mov         ecx,dword ptr [esp+14h] ;; version from top_ ;
004011E8  mov         dword ptr [esp+40h],esi ;; update next_ in user_data[0] ;
004011EC  mov         eax,esi                 ;; pointer from top_ ;
004011EE  mov         edx,ecx                 ;; version from top_, edx:eax is original value for top_ ;
004011F0  lea         edi,[esp+10h]           ;; address of atomic (top_) ;
004011F4  lea         ebx,[esp+40h]           ;; user_data[0] ecx:ebx is new value for top_ ;
004011F8  lock cmpxchg8b qword ptr [edi]      ;; compare edx:eax to top_ (edi), exchange with ecx:ebx on a match, result in edx:eax ;
004011FC  cmp         eax,esi                 ;; retry on failure ;
004011FE  jne         wmain+90h (4011E0h)     ;;  '' ;
00401200  cmp         edx,ecx                 ;;  '' ;
00401202  jne         wmain+90h (4011E0h)     ;;  '' ;

Which is exactly what we want.

And pop becomes:

    d = reinterpret_cast<user_data*>(atomic_stack_.pop(0));
00401228  mov         esi,dword ptr [esp+10h] ;; pointer from top_ ;
0040122C  mov         edi,dword ptr [esp+14h] ;; version from top_ ;
00401230  test        esi,esi                 ;; check for null pointer ;
00401232  je          wmain+10Ah (40125Ah)    ;;  jump to error if null ;
00401234  mov         ebx,dword ptr [esi]     ;; load next_ from top_ entry (will be new top_);
00401236  mov         ecx,edi                 ;; copy version ;
00401238  inc         ecx                     ;; increment version, ecx:ebx is now value for top_ (version+1, top_ entry's next_) ;
00401239  mov         eax,esi                 ;; pointer from top_ ;
0040123B  mov         edx,edi                 ;; version from top_ edx:eax is original value for top_ ;
0040123D  lea         ebp,[esp+10h]           ;; address of atomic (top_) ;
00401241  lock cmpxchg8b qword ptr [ebp]      ;; compare edx:eax to top_ (ebp), exchange with ecx:ebx on a match, result in edx:eax;
00401246  cmp         eax,esi                 ;; retry on failure ;
00401248  jne         wmain+0FEh (40124Eh)    ;;  '' ;
0040124A  cmp         edx,edi                 ;;  '' ;
0040124C  je          wmain+10Ch (40125Ch)    ;;  '' ;

Which is also exactly what we want. Note that I said we'd avoid using union originally, and the only reason we get away with it here is that atomic_set is only used on the temporary and not the value in the heap.

If we hadn't checked up on the compiler we'd have a much slower and more inefficient (in both code space and execution time) program, and potentially an incorrect version of the code with a difficult to track down bug that our unit test failed to find. By analyzing the assembly code produced we can be confidant the code will work for the platform we're targeting.