Posted by mlam
on August 9, 2007 at 6:49 PM PDT
A comment in a previous blog asks why CVM keeps some data structures in the C heap instead of the Java heap. Here's the answer.
A comment in a previous blog asks ...
I'd love to hear the explanations on why specific things are on the Java heap vs. the malloc heap. In particular it seems like there are a lot of things outside the Java heap that need to refer to things inside the Java heap (e.g. jitted code) resulting in a potentially large number of roots when collecting garbage.
For those of you who haven't been following my blogs before, Erik is asking a specific question regarding the memory layout of data structures in the CVM Java virtual machine (aka the phoneME Advanced VM).
Erik, do you mean why specific things are in the C heap instead of the Java heap? Or do you mean why are specific things in the Java heap instead of the C heap? Well, let me answer both ...
Why are some things in the Java heap?
One of the features of the Java platform is automatic garbage collection (GC). One reason for having this is so that we can compact the memory usage and avoid fragmentation. The GC controls the Java heap and its layout. It works in conjunction with allocators to allocate objects in different regions of the heap as appropriate. Periodically, or as needed, the GC will free up memory that is no longer needed, and compact the rest to reduce the use of physical memory pages.
In general, objects that are instantiated in the Java programming language all reside in the Java heap.
Why are some things in the C heap?
... or the malloc heap as you call it (more accurately so). The virtual machine itself (in this case, CVM) is a piece of native code and is primarily written in C. It follows that some of its data structures must necessarily reside in the C heap.
But you're probably asking about VM data strutures like the CVMClassBlock (cb), CVMMethodBlock (mb), CVMFieldBlock (fb), and the CVMExecEnv (ee) data structure. These are just a few of the more prominent examples of VM data structures that have logical equivalents in the Java world i.e. Class, Method, Field, and Thread. These data structures are called meta-data.
Technically, you can choose to implement a VM that keeps all of these meta-data in the Java heap as well. For example, the SE HotSpot VM allocates its class meta-data (class, methods, fields) from the Java heap. I don't know what they do with thread though I am quite sure that some part of the thread (the native stack) at least resides in the malloc heap (or mmap'ed memory). The CLDC VM (aka phoneME Feature VM) also allocates its meta-data in the Java heap. For both these VMs, the reason for doing so is to be able to get memory compaction when the meta-data is no longer needed.
CVM chose to allocate these from the C heap because these data structures tend to be accessed a lot during Java code execution. For example, the invocation bytecodes specifies a constant pool entry that refers to the method to be invoked in terms of a String that names the method. The interpreter would quicken this into a pointer to the method itself. Similarly, the JIT generated code which need to access class meta-data are given direct pointers to the data itself. Having a direct pointer results in higher performance from not having to go through levels of indirection. Of course, direct pointers are only possible for objects that don't move. And that's why CVM allocates them from the C heap.
But, but, but ...
... but the SE HotSpot and CLDC VMs both allocate their class meta-data in the Java heap. How do they avoid the cost of the indirection, and get good performance anyway?
They do this by keeping pointer relocation tables of where all such pointers exists in the meta-data objects. When GC runs and these objects get moved, the pointers that point to them will be updated. That includes pointers that reside in the constant pool (due to quickening) and in JIT compiled code. Hence, they use direct pointers as well. The only difference is that they need to incur the footprint cost for the pointer relocation tables, and the GC time cost to relocate these pointers.
Now, Wait a Minute!
Did I just pull a fast one? Did I not say that CVM allocates the meta-data in the C heap because it wanted the benefit of direct pointers and not have to deal with relocating these meta-data objects? Why use the C heap when you can get the same benefit with allocating the meta-data from the Java heap? You would get the benefit of memory compaction in addition too.
The difference is this: for CLDC, we're dealing with extremely small libraries and applications, and therefore an extremely small heap. The number of classes (and therefore number of pointer relocation tables) for CLDC are far fewer than for CDC which is what CVM is primarily targetted for. Hence, the cost of relocating the meta-data isn't as expensive for the CLDC VM. CLDC is also extremely tight for space, hence, they need to compact as much as they can.
As for the SE HotSpot VM, we have JavaSE which has a lot more classes to deal with than CDC. However, JavaSE is traditionally targetted at relatively more capable machines with a lot more memory and computing power i.e. desktops and servers. Hence, the cost of relocating the meta-data is more tolerable there too.
CVM services the space in between where the number of classes are much larger than CLDC's but must run in machines that are not as capable as JavaSE's typical targets. Hence, the tradeoff decision was made early on to allocate these data structures in the C heap.
What about Fragmentation?
If it's in the C heap, it isn't compactible. Wouldn't that cause a lot of fragmentation? Yes, it will cause some fragmentation. CVM deals with this by allocating (for the most part) only one large contiguous block per class for the class meta-data. All the method, and field meta-data are contained in the same block allocation. This reduces fragmentation due to classes in general.
What about the JIT and roots?
Erik's comment mentioned the JIT compiled code having references to objects in the Java heap, and that this creates a large number of roots (as in GC roots that need to be scanned) during GC. This is not true.
Erik, when you referred to the JIT compiled code, I presume you meant the generated code and not the references on the Java stack that they operate on. CVM's JIT compiled code doesn't have any such references to the Java heap. Instead, there are references to the meta-data in the C heap instead e.g. the cb, mb, and fb's.
On the contrary, allocating these meta-data from the Java heap would require a lot of additional root traversals and reference fixups due to the pointers to the meta-data. Hence, CVM's approach actually results in less GC roots to scan.
Or were you asking about allocating the compiled code buffer itself from the Java heap? The CLDC VM does that. As a result, the compiled code can be relocated during GC.
Note that allocating the compiled code buffer from the Java heap doesn't actually reduces the number of roots the GC has to scan. You might be thinking of roots in terms of the root of a reference tree, and in this case, it appears that the JIT compiled code can hold a bunch of these roots. If there are references to objects in compiled code (which there isn't in CVM), then yes, you will increase the number of "roots" pointing into the Java heap from the outside.
However, roots are just object references that the GC knows about. They are no different than any other object reference e.g. fields inside an object. It does not matter much that they are outside or inside the Java heap. The GC still has to scan them. So, that aspect of it doesn't really make any difference. And again, I remind you that CVM's compiled code actually does not have any GC roots in them.
Nowadays, CVM is being used in a very diversed range of devices, some of which might even qualify as or outperforms the desktops of the past. Hence, in the future, it is possible that we (or members of the phoneME Advanced project on java.net) may choose to modify CVM to allocate class meta-data or compiled code from the Java heap instead of the C heap. The benefits that may incent this include the ability to do more compaction of memory usage, as well as added performance from being able to embed direct object pointers in the compiled code.
Whether those incentives will prove compelling enough to motivate the work, only time will tell.