Skip to main content

HotSpot lesson

6 replies [Last post]
bhadfield
Offline
Joined: 2004-01-26
Points: 0

I am developing a ZX Spectrum Java emulator (http://www.sonic.net/~surdules/projects/jzx/) and I have recently been trying to optimize its performance. In the process, I discovered something interesting about HotSpot that I did not know before, so I thought I'd share.

A bit of background first. I started the performance optimization work because I noticed another emulator (Jasper, http://www.spectrum.lovely.net/) was faster, so I wanted to catch up (nothing like a little competition, right?). I spent some time reading the Jasper code to see why this might be, and nothing jumped out. In fact Jasper has a monolithic switch() statement to decode the Z80 opcodes, while my code separates the switch() into a separate method which in turns calls other methods -- a cleaner design (I thought).

It turns out that the monolithic switch() surrounded by a tight while(true) loop is precisely the reason why Jasper is faster. Specifically, the main loop in Jasper is:

void decode() {
while(true) {
// decode the op locally
switch(opcode) {
...
}
}
}

However, the main loop in JZX is:

void emulate() {
while(true) {
// call another method to decode the current op
Z80.step();
}
}

When I look at Jasper with -Xprof, I see that decode() is marked as compiled. However, when I look at JZX with -Xprof, I see that emulate() is marked as compiled, but (and here's the key), Z80.step() is marked as interpreted!

This doesn't make any sense. After all, Z80.step() is called a lot, so from what I know about HotSpot, that should make it compiled, right?

Wrong. It turns out, that HotSpot _occasionally_ (at some time interval) samples _the top_ of the execution stack. If it sees a method there fairly often, then it compiles that method.

In Jasper's case, since decode() is a long-lived method that makes short-lived function calls, it turns out that decode() is far more likely than any other method to sit at the top of the stack, so it gets compiled and runs fast _even though_ its switch() statement(s) are inefficient.

In JZX's case, the emulate() method gets compiled (for the same reason), but the Z80.step() method is too fast to be sampled sufficiently often off the top of the stack), so it never gets compiled!

I made a very small change to JZX, and put the while(true) { ... } _inside_ Z80.emulate() (so around the switch(opcode) { ... }), and the performance doubled instantly (it went from 150fps to almost 300fps).

As much as this is interesting and good to know, it is also frustrating and counterintuitive, since it goes against what I intuitively thought I knew about HotSpot. This further means that instead of pulling code out of the switch() with separate method calls, I instead want to pull them _in_ so they get compiled (since the separate method calls probably will not be compiled, only the switch() will be compiled).

The idea came to me after reading this article on the Sun website that was describing how HotSpot is fast because (and here's the key) it does _not_ walk the stack backwards to figure out oft-called methods. It only samples the top of the stack. Normally, if it would walk the stack backwards, it would realize that emulate() is a long-lived method, and therefore it (and all its children) should be compiled. However, the HotSpot design essentially prevents that from ever happening, hence the slowness.

This all came at a cost in terms of code readability (the previous code organization with an explicit call to Z80.step() was more readable), but I managed to refactor it to clean it up nicely.

Razvan.

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
tackline
Offline
Joined: 2003-06-19
Points: 0

As an armchair observer, it looks like the problem is not that the loop is slow, but that step() is not inlined. If the loop was using up cycles itself, then it'd be on top of the stack for longer and hence get compiled. Also, I suspect HotSpot may see step() as too long for inlining.

Without step() being inlined into the loop, the method needs to reload (and recheck) the variables it uses. At a guess, a good proportion of the cycles are just going into repeated loads of exactly the same data every Z80 instruction. Possibly common loads are duplciated for each opcode.

A general problem of Z80 emulation is that it has a relatively large number of registers for an eight bit processor, whereas x86 has relatively few for a desktop CPU. I remember a CPCP6128 emulator in ARM assembler using a lot of shifting to pack the Z80 registers within a few ARM register (ARM has (had) a free shift with every arithmetic/logic instruction, and 16ish registers).

bhadfield
Offline
Joined: 2004-01-26
Points: 0

The problem is that step() is not compiled (whether's it's inlined or not is orthogonal in my opinion).

It is likely that HotSpot sees step() as too long for inlining, but certainly not too long for compiling (since when I conflate step() into emulate(), the entire emulate() method is compiled even though it's much longer, and everything reliably runs almost twice as fast).

The issue, really, is that step() is in this weird in-between state. It's a pretty long method, so running it interpreted takes a while; at the same time, it's not long enough to spend enough time at the top of the stack to be noticed by HotSpot and compiled. On the other hand, emulate() (which used to be a MUCH shorter method, but that takes much longer to run due to the while(true)), is compiled right-away.

The fact that HotSpot behaves this way means that it encourages longer, more monolithic methods, that stand a chance to be "noticed" at the top of the stack. This is not necessarily good software engineering practice, so it's a good idea to profile often (with -Xprof) to see what's going on under the covers.

linuxhippy
Offline
Joined: 2004-01-07
Points: 0

if its not compiled its most likely a bug, since hotspot compiles are triggerd by two factors:
* time on stop of stack
* invocation count

My program has tons of simple getters/setters which are almost no time on stack at all and all are compiled&inlined.
So even shortest methods are optimized, if you can proove that you's isn't try to create a very simple testcase, try it out with different jvms (at least 1.4, 1.5 and mustang) and file a bugreport.

It would be also interresting to see how other jvms behave here, like IBM or Bea-JRockit.

Btw. the "-XX:+PrintCompilation" may give you hints whats when compiled.

Good luck, lg Clemens

bhadfield
Offline
Joined: 2004-01-26
Points: 0

Interesting, so invocation count does make a difference?

This document seems to suggest that _only_ time spent at the top of the stack matters:

http://www.javaperformancetuning.com/news/qotm037.shtml

I will try other JVMs and the -XX flag you suggest below, thanks!

linuxhippy
Offline
Joined: 2004-01-07
Points: 0

If this article mentions that, its definitivly wrong.
Counting invocations was the basic idea of hotspot, the time-on-stack criteria was only added for special cases where a lot of work was done in loops (mostly in-main microbenchmarks).

The default compile-threshold is:
client(x86): ~1500
server(x86): ~10000

or in that area. I know there are some differences on SPARC.
You can even set the threshold with the CompileThreashold flag, but keep in mind that < 100 is not so good since not enough profiling data can be collected.

Good luck, lg Clemens

linuxhippy
Offline
Joined: 2004-01-07
Points: 0

You're right, step() should be compiled & inlined by hotspot.

If you could provide a small test-case and open a bug-report at http://bugs.sun.com I am sure some of the over-stressed hotspot engineers will take a look.

lg Clemens