Posted by mlam
on June 16, 2008 at 10:29 PM PDT
Why does the CVM GC stop the world for object allocations? The answer: for performance. Here's how it works ...
In a previous comment , Jamsheed asked, ...
"In CDC we have garbage collection invocation for fast lock contention case (From my understanding this is done for rolling the object allocation unsafe thread to gcsafe). My question is why should we invoke a gc call for reaching safe point while this can be achieved by simply making try heap lock a blocking lock in gc safe window(with slight modification to gc safe window). Or by polling try heaplock with safe point after every iteration."
Jamsheed, I presume that you are referring to the piece of allocation code that requests that all threads to reach a GC safe state. You're probably thinking that this is a rather slow operation, and that there are cheaper alternatives. So, why do this?
Here's why ...
Some background ...
For those who aren't clued in to what we're talking about here, the piece of code in question can be found in a few places. One of these is in
gc_common.c in the function that allocates memory for new objects. There, you'll see a test for a microlock. If the microlock is not currently held by the current thread, then the allocation code will request that all threads reach a GC safe point. This is also commonly referred to as "stop the world" in common GC discussions.
In a lot of common GC algorithms (as is the case with the current CVM GC), the GC needs to scan all object pointers in the entire VM in order to determine which objects are still alive and cannot be collected. In order to do this, it needs to ensure that the threads which are doing work won't be doing any work that can move these pointers around in a way that the GC won't know about. These threads are called mutator threads because they mutate (i.e. change) the state of pointers in threads. This is an over-simplified description of what the mutation is, but it's enough to illustrate the point.
A GC safe state is a state in which the thread agrees to NOT mutate any object pointers. Hence, when we need to GC, we must first ask all threads to reach a GC safe point. When the threads reach their respective GC safe points, they are said to have entered a GC safe state. And by definition, they won't be mutating the thread (at least, not in any way that gets in the way of the GC). So, effectively, the GC has "stopped the world" ... at least, stopped it from doing anymore mutation until further notice. The threads can still run and do work ... just not any work that mutates object pointers. If it needs to do any mutation, it will block until the GC gives it permission to proceed.
So, what has this got to do with object allocation?
The fastest way to do allocation from a region of contiguous free heap memory is by simply bumping a top of heap pointer. Basically, the GC/heap keeps a top of heap pointer. This pointer points to the highest memory in the heap that has been allocated. When a new allocation is needed, it simply bumps the pointer up by the size of the needed memory. The previous value of the pointer would be the address of the newly allocated memory.
However, that only works if there's only one thread that does all the allocation. If you can have more than one thread, then we need to make sure that those threads don't try to bump the pointer at the same time. In order to do this, CVM uses a spinlock microlock (in most target platforms). The spinlock is implemented using a single atomic swap instruction. The atomic swap is used to check a flag and at the same time mark the flag as being locked.
Most of the time, different threads aren't trying to allocated at the same time. Hence, the thread who wants the microlock flag will almost always succeed in acquiring it. That thread then quickly bumps the top of heap pointer to do its allocation, and thereafter, release the microlock flag.
OK, that sounds nice ... but what happens in the case when the threads do try to allocate at the same time (infrequent as it may be)?
The Slow Path
When more than one thread is contending to do memory allocation at the same time, then the second thread, T2, who tried to acquire the microlock flag will need to block until the first thread, T1, releases the flag. Note: the microlock flag is just a flag field. It is not a mutex. Hence, there is no blocking functionality associated with it. And we need T2 to block until T1 is done with its allocation, at which point, we wish T2 to be woken up and to take control of the heap to do its allocation.
One way to do this is to have T2 request a "stop the world" on all threads. After initiating this request, T2 will be blocked waiting for all other threads to reach their GC safe points. Meanwhile T1 is doing its memory allocation in a GC unsafe state. After its allocation is done, T1 sees the "stop the world" request, and responds by entering a GC safe state. Eventually, all threads whould have entered their respective GC safe states, and this will wake T2 up. At this point, T2 can safely proceed with its memory allocation without having to worry about contention from other threads because ... they are all "stopped".
To recap: the fast case for memory allocation can only occur while a thread is in a GC unsafe state. If T2 requested a "stop the world" that puts all threads into a GC safe state, then it is guaranteed that no one else will be trying to do a fast allocation at the same time. And since T2 is the one who successfully requested a "stop the world", no other threads can request a "stop the world" at the same time. This guarantees that T2 will be the only one who can do the slow path of the memory allocation.
Hence, the "stop the world" request is used in this case as a mechanism to synchronize threads to handle contention for heap resources during memory allocation.
So, back to Jamsheed's question ...
Why not just use a mutex?
The reason is because a mutex is much more heavy-weight than testing the microlock spinlock flag. This may not be as evident if you are just looking at the allocation C code in
gc_common.c. However, there is an assembly version of the fast path allocation code that is used by JIT compiled code. Having this fast path allows JIT compiled code to stay executing in compiled code. Transitioning out to C code to do the allocation will be expensive as it will force a lot of overhead code to be executed. JIT compiled code can test the spinlock flag easily for the fast case, but it cannot lock a system mutex (which is target platform implementation specific) without transitioning out to C code.
Hence, we uses the spinlock flag instead of a real mutex in order to get better allocation performance for JIT compiled code. And as I've pointed out above, most of the time, there will be no contention and we can continue to use the fast path. In the more rare case when contention occurs, the JIT code will transition out to C code to run the slow path which uses a "stop the world" request to synchronize all threads.
In Summary ...
The "stop the world" mechanism may be more over-weight than a mutex, but it is only used in the allocation slow path which happens infrequently. Meanwhile, the "stop the world" mechanism allows us to use the spinlock flag as a simple contention checker that allows us to do fast allocation most of the time. In the overall scheme of things, this approach yields better performance than using a mutex directly which incurs slower execution (compared to the spinlock flag check) in the majority of allocation cases.
I hope that helps. =)