March 25, 2010
The classical approach to dealing with contention is to add locks to your program. You may have heard that locks are bad. That's not entirely true, but they can lead to issues, especially when it comes to code reuse.
First off locks are slow and heavy weight, or at least the classical versions are. (a http://en.wikipedia.org/wiki/Futex can be faster.) Generally to obtain a lock one must talk to the kernel which can ensure uncontended access to the data behind the lock. Atomics do not involve the kernel and tend to be faster, but also suffer from some similar issues when used like locks.
Secondly locks do not compose well. Take for example:
or in C land:
If the descendant class (or the function pointer call override) ends up calling code that tries to obtain the same lock that has already been locked we'll deadlock. This may appear contrived, but does occur a lot in practice because many libraries are built with an API based on the user inheriting from a class and overriding virtual methods to implement behavior, or on passing in a callback function that is called to respond to an event of some kind.
Deadlocks can also happen when two resources are used in the opposite by two different components:
If a thread swap occurs just after
In general when acquiring a lock on multiple resources you should use multiple-locking handlers, i.e.
Lock-free algorithms (http://en.wikipedia.org/wiki/Lock_free) can also help out, but are often very difficult to come up with. Proving that an algorithm is truly lock-free can often require considerable work. I suspect the
Another issue is over-contention. If data can be accessed by all thread (not just updated) than it can be contended by all threads and the lock status for that data must be visible to them all. So far this hasn't been a big issue since the number of executing threads has been bounded to a low number of cores. As the number of cores we see on the desktop increases the level of contention is going to be a serious barrier to linearly scaling performance.
Let's say for instance you have lots of cores accessing data in a program that is guarded by an atomic that arbitrates which part of the data each core is updating. When we start each core attempts to acquire part of the data and one of them wins and that other cores have to try again. If trying again takes 40 cycles and the processing of the data takes 120 cycles how many cores will actually be able to work on the data at any given time? Not a lot. Worse yet on some machines it is possible for no cores to win their attempt to modify the atomic because of cacheline shadowing caused by work being done in a another core.
The solution to the contention problem, and many of the locking problems, is to avoid them by making the data local to the thread processing the data and acting on thread local data as much as possible. That does mean taking up a lot more memory as data wont be shared between threads as much as it would have. Unfortunately our current programming languages have very poor support for this model. C and C++ don't supporting decorating a pointer in a manner that allows us to determine if the memory is globally accessible vs accessible only by the current thread. (Let alone a means of specific a hierarchy of accessibility between threads.)
Making common patterns that we've been for decades thread aware imposes a considerable burden on our code. For instance many developers use some sort of interface for allocating memory in non garbage collected languages. Initially it starts out innocently enough:
It'll eventually grow some warts like alignment specification which is needed on almost all hardware these days.
So now code goes from this:
Okay, okay, we'll define global operator new and delete instead because we're smarter than that. Um, but wait. There is that alignment thing we mentioned earlier, so it's actually:
And that's just the start of the parameter creep for
The other reason we can't use global new is that someone will eventually come along and say "globals are bad" and insist on making multiple allocators. Taken to the extreme this will lead to every object in your system having an
Lets setup back for a second. Globals are only bad when what they are responsible for can exist multiple times. You only have a limited number of memory pools so it makes sense to have an allocator per pool. Most general programs address a single pool of memory: the heap (or freestore in c++ land). So lets assume we resist the temptation and stick with a global allocator.
Now lets move to a multithreaded environment. The traditional approach is to add a lock:
Now every thread in the system contends the same lock whenever it does an
Now if a function needs memory for something that can't escape from the thread it can use
If we're willing to give up some space per allocation we can stuff a marker into the front of each allocation that points back to the allocator that the allocation should be freed into. This would let us detect cases where an attempt is made to free an allocation to the wrong allocator. We could even take it a step further and have the allocators detect when an allocation is from a different thread and add a
For this to work your application binary interface (ABI) has to support thread local store (TLS) that is fairly quick (at least not slower than using locks would be). Most sane ABIs reserve a register or allocate a fixed area at the bottom of the stack for storing information about TLS. Be sure to check what the platform you're using does as some do really silly things when TLS is used; like using locks, which completely defeats the point of using TLS in the first place.
At this point we're relying on the system providing a specific ABI and every allocation containing tracking data so we're already half way to implementing garbage collection. Using a garbage collected language can significantly improve the performance of concurrency code while also reducing its complexity. Generational garbage collection can batch up the de-allocation of memory reducing the number of times the system enters a contended state. You also don't have to worry about data life time as much so thread executing asynchronously are less likely to refer to data that was freed at the wrong time. (Reference counting garbage collection, like
The more we can push onto the system to manage the better job it can do at lowering contention and the more we can focus on solving problems instead of contention issues.
Software Development >