Wait Free (Atomic MRMW Queues)

July 1, 2010

I'm going to present a pair of "Wait Free" containers. The first is based on the Atomic Stack I started with back in June. The second is an Atomic Queue. Before I do I'm going to explain a bit about what "Wait Free" means and why it is different than using the standard approaches to building containers used in multi-threaded applications.

Wait Free

Wait Free operations on a container do not block or spin, or cause other threads to block or spin. Spinning in this case would mean to retry the operation even though the state of the container being operated on hasn't changed. Retrying an operation when another thread has change the container is acceptable because that means progress has been made somewhere else in the system. This may sound simple, but it isn't. The traditional approach of using locks to protect accesses to the container does not qualify as Wait Free since taking a lock blocks out all other threads, and if the thread holding the lock is suspended or crashes the other threads in the system remain blocked.

Wait Free algorithms can, generally, be composed without worry. Since no blocking occurs, typical problems like Dead Locks can't happen. The major issue with Wait Free algorithms is proving that they are in fact Wait Free. The analysis required to do this requires examining the assemble code, the hardware used to execute it, and the nature of the multi-threading approach used by the system. In general one must assume that a thread can be swapped out between any two assemble instructions. That leads to issues when working with expressions that look atomic when written in C/C++, but are in fact not atomic when compiled.

For example:

uint64_t x = *(uint64_t volatile*)addr;

Will compile to two instructions on a 32bit architecture. That can result in one thread loading the top value of the 64bit value, another then modifying the 64bit value, and the original thread then loading the bottom half of the new value resulting in mixing the original and new values into an incorrect third value. This is actually a mistake I made in the original Atomic Stack I posted in June. In that case it didn't break the algorithm because only one half of the 64bit value was being used to make decisions.

Stack

Most of my original Atomic Stack is valid. I've updated it to no longer require a clear method and instead made the constructor preserve the version field. It also maintains the three low bits of the version field in a fashion that is compatible with the Queue implementation and it required items to be descendants of atomicial rather than using an offset to next pointer approach.

A stack in the single threaded world has last-in-first-out (LIFO) ordering. When we enter the world of multi-producer-multi-consumer (MPMC) multi-threading the order that elements are dequeued become somewhat more nebulous. It we think of the stack as the means of transmitting messages from several producers to several consumers, where each consumer is equal, than is would be more valid to say that messages produced later in time have a higher probability of being processed next. We can't guarantee processing ordering because a consumer thread can be suspended between removing an element from the stack and processing it, allowing another thread to remove the next element and process it before the first element was removed.

The general approach for building a Wait Free stack is much easier than for building a Wait Free queue. Since a stack adds and removes elements from the same locations (the top of the stack) a simple compare and swap can be used against the same location for both push and pop operations. The value must also contain a version to avoid A-B-A issues. (See the original Atomic Stack post.)

Queue

In a single threaded world a queue has first-in-first-out (FIFO) ordering. In the MPMC world it translates closer to elements added earlier in time have a greater probability of being dequeued next. The challenge in building a Wait Free queue comes from having two places, a head and a tail, where the queue can be modified.

For a single threaded queue we have:

ctor() : head_(0), tail_(0) {}
void enq(item *e) {
  e->next_ = 0;
  if (tail) {
    tail_->next_ = e;
    tail_ = e;
  } else {
    tail_ = e;
    head_ = e;
  }
}
item * deq() {
  if (!head_) return 0;
  item * temp = head_;
  if (head_ == tail_)
  {
    assert(head_->next == 0);
    head_ = 0;
    tail_ = 0;
  } else {
    head_ = head_->next_;
  }
  return temp;
}

The first issue with turning this into a Wait Free queue comes from having a queue with only one element in it. In that case we have the head and tail both pointing to the same element. Lets say we dequeue using a compare and swap to move the head_ to head_->next_. At the same time another thread is doing an enqueue and doing a compare and swap on tail_->next_ from 0 to e. The value swapped in for the new head could be either 0, at which point we clear out the tail even though the enqueue's compare and swap is successful in adding e. This happens because the two address being used are not the same (one is updating head_, and the other is updating tail->next_) but the values from those addresses are being used in both threads. The easiest way to avoid is to create a sentinel element that is treated specially by enqueue and dequeue and prevents the head and tail from ever pointing to the same (non sentinel) element. The special treatment the sentinel is given involves having both enqueue and dequeue take part in moving it from the head_ position to the tail_ position whenever it arrives at the head and a dequeue is requested. I refer to this as "change over".

Change over involves moving the sentential from the head to the tail. We can not dequeue it and re-enqueue it like a normal element, because if we did we'd run into the same race outlined above. Instead the sentinel carries two bits of state in the low bits of the version assigned to its next pointer. While we're on that subject, another bit is also going to be used for indicating if an element is a member of a queue; so we'll be reserving three bits in total. The states the sentinel can be in are 'clear', 'remove', and 'insert'. While in the queue the sentinel is in the 'clear' state, that means operations on the queue can ignore it. When the sentinel reaches the head of the queue and an enqueue is requested we compare-and-swap the sentinel to the 'remove' state. Every operation on the queue checks for the sentinels state before proceeding (but after chase_tail, which I'll describe next), and if it is in the 'remove' state and the head still points to the sentinel it attempts to move the head forward to the sentinels next pointer. If the head isn't pointing to the sentinel (some thread has succeeded in moving the head past it) then the operation tries to change the sentinel state to 'insert' with a pointer of null. Operations that see the 'insert' state check the tail pointer and if it is not the sentinel they attempt to set the tail's next to the sentinel and move the tail to the sentinel. Once a thread succeeds in enqueuing the sentinel at the tail operations seeing the 'insert' state attempt to swap the sentinels state to 'clear' (keeping the pointer null to terminate the queue). Because all operations on the queue take part in the change over we avoid the single element race condition.

The last issue is with managing the tail. When we go to enqueue an element we compare and swap the tail's next pointer to the element (after nulling out the element's next pointer to keep the queue terminated). Following that we compare and swap the tail pointer to the element. Since a race can occur between these two compare and swaps the second one can fail. We can't simply retry it since other elements might have been enqueued. Instead we have to have every operation on the queue check the tail's next pointer before doing anything and if it is non-null compare and swap the tail with it's next pointer until the tail has reached the actual end of the queue. Another issue that can arise when enqueuing comes from the compare and swap with the next pointer racing against a dequeue of the element that was at the tail. Because we null out the next pointer when we dequeue the element we could end up enqueueing the element onto the next pointer of an element that has been dequeued already. To prevent this we require all elements to have a versioned next pointer. We also maintain the low bit of the version to indicate if the element is in a queue (and thus a valid tail) or out of the queue. This mainly helps in debugging since the version alone would be sufficient in preventing the race.

I've included the code I've been using to test out this approach (it's attached below). I only have a dual-core machine and have not been able to test it systems with physically separate processors and higher processor counts. If you do try it please let me know if it works for you (as in doesn't crash or assert). The code is under MIT license, so basically it's free and you can use it for anything as long as you take responsibility for it.