March 25, 2010
The classical approach to dealing with contention is to add locks to your program. You may have heard that locks are bad. That's not entirely true, but they can lead to issues, especially when it comes to code reuse.
First off locks are slow and heavy weight, or at least the classical versions are. (a http://en.wikipedia.org/wiki/Futex can be faster.) Generally to obtain a lock one must talk to the kernel which can ensure uncontended access to the data behind the lock. Atomics do not involve the kernel and tend to be faster, but also suffer from some similar issues when used like locks.
Secondly locks do not compose well. Take for example:
void some_component::some_method()
{
lock::holder lh(lock_);
some_virtual_method();
}
or in C land:
void func(cont * c)
{
lock(c->lock_);
c->function_pointer_call();
unlock(c->lock_);
}
If the descendant class (or the function pointer call override) ends up calling code that tries to obtain the same lock that has already been locked we'll deadlock. This may appear contrived, but does occur a lot in practice because many libraries are built with an API based on the user inheriting from a class and overriding virtual methods to implement behavior, or on passing in a callback function that is called to respond to an event of some kind.
Deadlocks can also happen when two resources are used in the opposite by two different components:
void component_one::method()
{
resource_one::auto_lock al_one(resource_one_);
resource_two::auto_lock al_two(resource_two_);
// use resources
}
void component_two::method()
{
resource_two::auto_lock al_two(resource_two_);
resource_one::auto_lock al_one(resource_one_);
// use resources
}
If a thread swap occurs just after component_one
acquires a lock for resource_one_
and component_two
acquires the lock on resource_two_
we will deadlock. If we own the code we can fix it, but it the code is off in a library we can't control we're in trouble. In practice it is becoming more and more likely that you won't have control over a lot of the code being executed in your programs because it will come from other libraries.
In general when acquiring a lock on multiple resources you should use multiple-locking handlers, i.e. lock::handler lh(lock1_, lock2_)
, that can ensure a consistent locking order.
Lock-free algorithms (http://en.wikipedia.org/wiki/Lock_free) can also help out, but are often very difficult to come up with. Proving that an algorithm is truly lock-free can often require considerable work. I suspect the atomic_stack
I present on the Atomics page is but do not have a formal proof for it. In the past I've been fairly sure that an algorithm was lock-free only to discover an exception case later. Many people mistake using atomics for being lock-free and end up with spin locks in their program rather than having code that is always guaranteed to be making progress.
Another issue is over-contention. If data can be accessed by all thread (not just updated) than it can be contended by all threads and the lock status for that data must be visible to them all. So far this hasn't been a big issue since the number of executing threads has been bounded to a low number of cores. As the number of cores we see on the desktop increases the level of contention is going to be a serious barrier to linearly scaling performance.
Let's say for instance you have lots of cores accessing data in a program that is guarded by an atomic that arbitrates which part of the data each core is updating. When we start each core attempts to acquire part of the data and one of them wins and that other cores have to try again. If trying again takes 40 cycles and the processing of the data takes 120 cycles how many cores will actually be able to work on the data at any given time? Not a lot. Worse yet on some machines it is possible for no cores to win their attempt to modify the atomic because of cacheline shadowing caused by work being done in a another core.
The solution to the contention problem, and many of the locking problems, is to avoid them by making the data local to the thread processing the data and acting on thread local data as much as possible. That does mean taking up a lot more memory as data wont be shared between threads as much as it would have. Unfortunately our current programming languages have very poor support for this model. C and C++ don't supporting decorating a pointer in a manner that allows us to determine if the memory is globally accessible vs accessible only by the current thread. (Let alone a means of specific a hierarchy of accessibility between threads.)
Making common patterns that we've been for decades thread aware imposes a considerable burden on our code. For instance many developers use some sort of interface for allocating memory in non garbage collected languages. Initially it starts out innocently enough:
class allocator {
public:
void * alloc(size_t);
void free(void *);
};
extern allocator allocator_;
// off in a .cpp file somewhere:
allocator allocator_;
It'll eventually grow some warts like alignment specification which is needed on almost all hardware these days.
So now code goes from this:
some_class * c_ = new some_class();
...
delete c_;
to:
#include <new>
...
some_class * c_ = reinterpret_cast<some_class*>(allocator_.alloc(sizeof(some_class)));
new (c_) some_class();
...
c_->~some_class();
allocator_.free(c_);
Okay, okay, we'll define global operator new and delete instead because we're smarter than that. Um, but wait. There is that alignment thing we mentioned earlier, so it's actually:
some_class * c_ = reinterpret_cast<some_class*>(allocator_.alloc(sizeof(some_class), alignment ));
And that's just the start of the parameter creep for alloc
. So we really can't define global operator new and handle all the allocation parameters we want. The C++ standard doesn't support a method for indicating the alignment of structures or classes or obtaining it, and many times we'll be allocating memory without having an associated class or structure to base the alignment off.
The other reason we can't use global new is that someone will eventually come along and say "globals are bad" and insist on making multiple allocators. Taken to the extreme this will lead to every object in your system having an allocator * allocator_
in it so you can free it.
Lets setup back for a second. Globals are only bad when what they are responsible for can exist multiple times. You only have a limited number of memory pools so it makes sense to have an allocator per pool. Most general programs address a single pool of memory: the heap (or freestore in c++ land). So lets assume we resist the temptation and stick with a global allocator.
Now lets move to a multithreaded environment. The traditional approach is to add a lock:
class allocator_mt
{
public:
void * alloc(size_t);
void * free();
protected:
allocator allocator_;
lock lock_;
};
The alloc
and free
methods grab the lock and proxy through the the underlying allocator.
Now every thread in the system contends the same lock whenever it does an alloc
or free
; but we just finished saying earlier that we want to avoid this for systems with a lot of threads. Lets pretend there is a way to make each thread have its own allocator, say something like:
allocator_mt allocator_mt_;
__thread allocator allocator_(allocator_mt_);
Now if a function needs memory for something that can't escape from the thread it can use allocator_
for it rather than allocator_mt_
. For memory than can escape from the thread (i.e. data that is sent to other threads) we use the thread aware allocator. This is somewhat unsafe still because we can't mark the pointers to indicate which allocator owns the memory.
If we're willing to give up some space per allocation we can stuff a marker into the front of each allocation that points back to the allocator that the allocation should be freed into. This would let us detect cases where an attempt is made to free an allocation to the wrong allocator. We could even take it a step further and have the allocators detect when an allocation is from a different thread and add a void foreign_free(void*)
method to the allocations that frees the allocation safely into a contended free list.
For this to work your application binary interface (ABI) has to support thread local store (TLS) that is fairly quick (at least not slower than using locks would be). Most sane ABIs reserve a register or allocate a fixed area at the bottom of the stack for storing information about TLS. Be sure to check what the platform you're using does as some do really silly things when TLS is used; like using locks, which completely defeats the point of using TLS in the first place.
At this point we're relying on the system providing a specific ABI and every allocation containing tracking data so we're already half way to implementing garbage collection. Using a garbage collected language can significantly improve the performance of concurrency code while also reducing its complexity. Generational garbage collection can batch up the de-allocation of memory reducing the number of times the system enters a contended state. You also don't have to worry about data life time as much so thread executing asynchronously are less likely to refer to data that was freed at the wrong time. (Reference counting garbage collection, like boost::sharped_ptr
, should be avoided because updating the reference count across threads introduces lots of contention.)
The more we can push onto the system to manage the better job it can do at lowering contention and the more we can focus on solving problems instead of contention issues.