Skip to main content

Improving performance for the first few execution passes?

5 replies [Last post]
ted_graham
Offline
Joined: 2007-11-14
Points: 0

I have a Java 6 app that does financial trading. I'm trying to improve the performance for the first few executions. I'm using -XX:CompileThreshold to try and force early compilation. With CompileThreshold set to 1, 3, 20 and 100 it appears that the compilation speedup comes after 9-10 executions. Interestingly, PrintCompilation does not show any relevant compilation happening around the 10th execution, but times consistently drop by a factor of 3 after 10-15 executions.

Running with CompileThreshold=10000 or not setting that flag does not give the big performance improvement after 10-15 executions.

Startup time and compilation time don't matter, any suggestions for reducing the number of executions before the performance improves?

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
doubletrouble
Offline
Joined: 2007-11-15
Points: 0

Briand wote:

>The longer we run in interpreted mode,
>the more profiling information we can get, the better our optimization strategy can be.
>So, setting this low will limit the number of optimization techniques that can be applied.

Uhh, I must be missing something here: are you saying that you only do a single profiling and compilation phase? And that once that is over no further improvements will be made?

I would have thought that what you would do, especially for the server jvm, is continuous profiling and improvements!

In particular, I thought that hotspot does something sophisticated like the following:

1) start out by interpreting everything

2) determine what methods are hotspots and quickly compile them to native code

3) periodically look again at what methods are hotspots; if currently interpreted, then quickly compile them, but if already compiled, see if can perform more sophisticated optimizations (e.g. branch prediction, caching, inlining of methods called, loop unrolling, array bounds check removal, etc), especially now that have even more profiling information.

You seem to indicate that step 3) above is simply not done; that once a method has been compiled, no further optimizations will be done on it. Is this correct? And if so, then why are you avoiding step 3)--complexity too hard at the moment?

If you did implement step 3), then we programmers would not have to worry about the setting of CompileThreshold because even if a method is compiled early, it would still get recompiled later on based on better profiling information (assuming that it was still a hotspot).

linuxhippy
Offline
Joined: 2004-01-07
Points: 0

> Uhh, I must be missing something here: are you saying
> that you only do a single profiling and compilation
> phase? And that once that is over no further
> improvements will be made?
Well as long as it runs in interpreted mode it is beeing profiled and those profile results are used later for compilation, applying the most agressive optimizations the profile allows.
If later some optimizations break, compiled code is thrown away and the method starts again in interpreted mode with profiled-interpreted and compiled stage later.

The TiredCompilers currently do profiling also in compiled code from the client-compiled as far as I know, but once code is compiled using the server-compiler no profiling takes place.

> I would have thought that what you would do,
> especially for the server jvm, is continuous
> profiling and improvements!
You most likely would not like the overhead constant profiling would add to your optimization.

> 1) start out by interpreting everything
right.

> 2) determine what methods are hotspots and quickly
> compile them to native code
right.

> 3) periodically look again at what methods are
> hotspots; if currently interpreted, then quickly
> compile them, but if already compiled, see if can
> perform more sophisticated optimizations (e.g. branch
> prediction, caching, inlining of methods called, loop
> unrolling, array bounds check removal, etc),
> especially now that have even more profiling
> information.
>
> You seem to indicate that step 3) above is simply not
> done; that once a method has been compiled, no
> further optimizations will be done on it. Is this
> correct? And if so, then why are you avoiding step
> 3)--complexity too hard at the moment?
As I said above, profiling is expensive - you need to change generated machine code to update your profile counters - which is something you don't want to have in your hot-spots.
When its compiled it is compiled quite agressivly, so the much more common case than under-optimization is actually over-optimization -> so code is "deoptimized" (=replaced with interpreted mode) and optimized later less agressive.

> If you did implement step 3), then we programmers
> would not have to worry about the setting of
> CompileThreshold because even if a method is compiled
> early, it would still get recompiled later on based
> on better profiling information (assuming that it was
> still a hotspot).
Well thats exactly what the TiredCompilers are there for. Provide good enough optimizations to most of the code in a fast time, and try to crunch out whats possible for the hot-spots.

lg Clemens

briand
Offline
Joined: 2005-07-11
Points: 0

Forcing JIT compilation to occur earlier may actually significantly reduce the
performance later in your process's life time. The longer we run in interpreted
mode, the more profiling information we can get, the better our optimization
strategy can be. So, setting this low will limit the number of optimization techniques
that can be applied.

I suspect that you see the better performance from larger settings of CompileThreshold
because by waiting longer we have better information, allowing us to do more
aggressive optimizations.

You might want to take a look at TieredCompilation in JDK 7 and see if that helps
your application. Just run with -XX:+TieredCompilation. The tiered compilation system
has multiple levels of compile thresholds, so tuning is a bit more complicated. However,
you should just try with the defaults to see if it's a win out of the box.

Other than that, you could try to send some dummy transactions through the system
to force JIT compilations sooner; since you are not worried about startup time, performing
such warmup operations at startup might be all you need.

Another alternative is to try using -Xbatch. This can sometimes results in compiled
methods getting executed sooner. The default background compilation will only
use the compiled method on the next invocation of the method; if that doesn't happen
for a while, you'll either be executing in the interpreter or in an OSR'ed (On Stack
Replaced) block of the method. In -Xbatch mode, where JIT compiles happen in
the 'foreground', the thread waits for the JIT compile to complete and immediately
switches over to the compiled code.

ted_graham
Offline
Joined: 2007-11-14
Points: 0

Thanks for the response. I tried -Xbatch, but it still takes ~10 passes to get fast, and the first 10 are slower than with background compilation.

What I was really curious about was how the CompileThreshold flag is treated, and why I don't see differences with 1, 3, 20, 100. Are values under 100 all set to some CompileASAP flag?

I'm interested in trying out TieredCompilation. Is JDK 7 ready for use?

As for dummy transactions to warm things up, I may have to, but I've been hoping to avoid it.

How about using the Java Real Time System? I understand it allows for pre-run compilation to avoid JIT delays.

briand
Offline
Joined: 2005-07-11
Points: 0

> Thanks for the response. I tried -Xbatch, but it still takes ~10 passes to 10
> get fast, and the first are slower than with background compilation.

-Xbatch isn't always a win (which is why it's not the default), but it's usually worth a try.

>
> What I was really curious about was how the CompileThreshold flag is treated,
> and why I don't see differences with 1, 3, 20, 100. Are values under 100
> all set to some CompileASAP flag?

No, not that I'm aware of. As I mentioned earlier, I think you are seeing the effect
of better profiling information driving better optimization of the code. At low compiler
threshold numbers, the statistics don't support more aggressive optimizations.

>
> I'm interested in trying out TieredCompilation. Is JDK 7 ready for use?

It's ready for you to try it in a development environment so you can measure
the effect of the various enhancements in that platform. However, it's still a work
in progress.

>
> As for dummy transactions to warm things up, I may have to, but I've been
> hoping to avoid it.

Understood - it's not the most attractive option, but it is a generic, portable
solution. You can even try tricks like loading the warmup code in a different
class loader and then when the warmup is finished, drop any references to
the classes and class loader and any subsequent full collection will collect
the warmup classes and any corresponding native code created by the JIT
for those warmup classes.

>
> How about using the Java Real Time System? I understand it allows for pre-run
> compilation to avoid JIT delays.

It supports the notion of 'training runs' that generate lists of classes to load
and methods to compile at initialization time. And yes, this shifts any JIT related
jitter to initialization time, when response times are of less concern.

However, it's important to note that 'real-time' doesn't mean 'real-fast'. The real-time
space is designed to provide for predicable response times, and not necessarily
high throughput. It's a classical throughput/response time trade-off. So, you could
certainly use the Java RTS system and get initialization time compilation and also
get better response times across the board. However, if your application is more
concerned with throughput than response times, then you will likely need to use
more systems to achieve the same throughput you are getting today.

If you are already driving an existing system with very high throughputs, you are
are likely sacrificing response times anyway, as systems sustaining high throughput
rates typically end up building queues that can increase transaction response times.
Tuning that system to minimize response times would likely result in throughputs that
are similar to the throughputs you would achieve with Java RTS; but Java RTS is
still more likely to give you more predicable response time distributions.

Finally, to achieve the best response times with Java RTS, you'll probably need to
make some code modifications. For example, you'll likely want your threads to be
javax.realtime.RealtimeThread threads instead of java.lang.Thread threads. This is
actually a simple change to make if you own all of your code, but if you rely on
third party software, like J2EE servers for example, then you'd need that system to
support Java RTS in order to take advantage of all the system has to offer.