Skip to main content

1.5.0_06 performance annomaly

16 replies [Last post]
alanstange
Offline
Joined: 2003-06-12
Points: 0

This code was posted in a mustang bug report:

import java.util.*;
import java.text.*;
import java.io.*;

public class TestEscapeAnalysis
{
private static final int COUNT = 100000000;
public static void main(String[] args) throws Exception {
test();
test();
test();
test();
test();
test();
test();
test();
test();
}

private static void test() {
int x = 0;
Object lock = new Object();
long ts = System.currentTimeMillis();
for (int i=0; i

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
briand
Offline
Joined: 2005-07-11
Points: 0

The HotSpot JVM knows when it's running on a single processor system and applies certain locking optimizations that result in better performance compared to multi-threaded systems (including hyper-threaded cores).

For Solaris users, using psradmin to disable all but one processor won't fool the JVM into thinking it's running on a single cpu system. It assumes that processors can come on line at a later time and therefore doesn't apply the optimization.

I noticed in one of your blog entries that you suggested -XX:+UseBiasedLocking, but you suggested it's only useful for single threaded codes. That's not true - it's also useful for multi-threaded codes that have uncontended locking, which are far more prevalent than one might think. Our SPECjbb2005 results demonstrate this.

I'm not sure what's happening with ArrayList in your experiments. I'd have to see the code or some gc logs to be sure.

Brian

linuxhippy
Offline
Joined: 2004-01-07
Points: 0

Theres one thing I still don't know till now:

How expensive is biased locking in the uncontended case if many different threads take the lock, this is e.g. the case when creating thread-safe pools?
Is it faster or slower than the default implementation?

Thank you in advance, lg Clemens

kbr
Offline
Joined: 2003-06-16
Points: 0

Biased locking is effective in the case where locking is not only uncontended, but locked objects are not shared between threads. When an object biased toward one thread is attempted to be locked by another thread, a relatively expensive bias revocation operation is necessary. We have added heuristics to the Java HotSpot VM to detect situations where the optimization is not performing well and to either "re-bias" objects in bulk or disable it for the data types involved. In general we have found that these heuristics allow the optimization to be effective where it can be, and to not slow the system down in situations where it is not effective.

You can find more detail on our biased locking implementation in a paper to be published at this year's OOPSLA 2006 conference. I'll try to make this paper available on the Sun Labs web site as well.

linuxhippy
Offline
Joined: 2004-01-07
Points: 0

Thanks a lot for the clarification :-)

olsonje
Offline
Joined: 2005-08-10
Points: 0

I thought I'd check what mine would do on that, and here it is from two different machines.

P4 (No HT nor Duel Core) 2.8GHZ 2GB WinXP(all patches)
H:\>java -version
java version "1.5.0_06"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-b05)
Java HotSpot(TM) Client VM (build 1.5.0_06-b05, mixed mode, sharing)

H:\>java -client TestEscapeAnalysis
100000000, time=1.203
100000000, time=1.25
100000000, time=1.234
100000000, time=1.25
100000000, time=1.219
100000000, time=1.688
100000000, time=1.297
100000000, time=1.234
100000000, time=1.219

P4-D(uel Core) 2.8GHZ 2GB WinXP(all patches)
H:\>java -version
java version "1.5.0_06"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-b05)
Java HotSpot(TM) Client VM (build 1.5.0_06-b05, mixed mode, sharing)

H:\>java -client TestEscapeAnalysis
100000000, time=7.024
100000000, time=7.051
100000000, time=7.051
100000000, time=7.051
100000000, time=7.024
100000000, time=7.038
100000000, time=7.024
100000000, time=7.078
100000000, time=7.051

alexlamsl
Offline
Joined: 2004-09-02
Points: 0

> P4-D(uel Core) 2.8GHZ 2GB WinXP(all patches)

OT: wow, that's one aggressive CPU! :-D

olsonje
Offline
Joined: 2005-08-10
Points: 0

> > P4-D(uel Core) 2.8GHZ 2GB WinXP(all patches)
>
> OT: wow, that's one aggressive CPU! :-D

You have no idea!! 8|

And yea, I noticed that after I already posted, but I had to get some stuff done asap. Eh, oh well. :)

forax
Offline
Joined: 2004-10-07
Points: 0

It' not a bug, you just see the JIT in action :)

The JIT compile byte-code to native code in the same
time than the test run, so it has an impact
on performance (for a mono processor hardware)

Futhermore are you sure that escape analysis was
backport from mustang to tiger.

Rémi Forax

alanstange
Offline
Joined: 2003-06-12
Points: 0

The compiler was all done before it got to that poing in
the program, based on using -XX:+PrintCompilation.

Do you really think that 4 seconds of cpu is reasonable to compile this program?

I'm not testing escape analysis on the 1.5.0 platform. I'm
pointing out a repeatable performance anomoly.

In fact, turning on -XX:+PrintCompilation results in the
run time for each test benig the same again.

olsonje
Offline
Joined: 2005-08-10
Points: 0

>I'm not testing escape analysis on the 1.5.0 platform.
> I'm pointing out a repeatable performance anomoly.

I can't get that anomoly to show at all like yours is showing. But what gets me is that my duel core p4 has a 7s time vs the 1s time on my old p4. What would cause that?

mthornton
Offline
Joined: 2003-06-10
Points: 0

One a single core (non HT) processor the JVM uses simpler code for synchronization than for multi-core machines. The simpler code is a lot faster.

olsonje
Offline
Joined: 2005-08-10
Points: 0

Noted, but that much longer? I mean the overhead from switching a thread context, and related peices, between cores shouldn't be _that_ significant espically when the thing is just a simple linear execution, its not spawning seperate threads trying to hit the same piece or anything, its completely sequential. I must be missing something here to think that a 6sec difference is a bit unreasonable.

mthornton
Offline
Joined: 2003-06-10
Points: 0

On a multi-processor the 'lock' prefix will have to be used to ensure that the processor executing the code has exclusive access to the data. This no doubt involves bus operations or at least cache synchronization whereas the single core code may be able to execute entirely from the processor cache --- it doesn't have to compete with or notify any other processor.

mlee888
Offline
Joined: 2006-03-31
Points: 0

single processor does indeed seem to perform a lot faster when synchronized is involved, but not always. I recently also made this observation in an real app as well as running a few simple tests. I got a little write up here.

http://mlee888.wordpress.com

greggwon
Offline
Joined: 2003-06-14
Points: 0

This example just points out that you really need to think about the code paths that all threads are going through.

Also, it is common in applications to use individual threads to process a complete compute path for a particular operation to avoid synchronization with another thread.

The new java.util.concurrent classes in JDK1.5 provide less interferring rendevous in some cases which we should all learn to take advantage of.

briand
Offline
Joined: 2005-07-11
Points: 0

While I don't have an explanation for the anomaly described in the original post (I couldn't reproduce it on the machines I have available to me), I do have a suggestion for improving codes that have uncontended synchronization, as this code does.

In 1.5.0_06, we added a feature called Biased Locking,
which is disabled by default. To enable it, add the following to the command line:

-XX:+UseBiasedLocking

For applications with uncontended synchronization, this option can make a huge difference in performance, particularly on multi-processor machines. Here's some results from an oldish linux box:

> java -client -Xbatch TestEscapeAnalysis
100000000, time=5.877
100000000, time=5.873
100000000, time=5.867
100000000, time=5.874
100000000, time=5.869
100000000, time=5.867
100000000, time=5.866
100000000, time=5.86
100000000, time=5.864
> java -client -XX:+UseBiasedLocking TestEscapeAnalysis
100000000, time=6.571
100000000, time=1.381
100000000, time=1.376
100000000, time=1.364
100000000, time=1.364
100000000, time=1.363
100000000, time=1.364
100000000, time=1.363
100000000, time=1.365

Note the initial time is roughly similar, but the JVM eventually figures out that the lock is not contended and biases it to the thread that typically accesses it. And yes, it deals with the case where the lock does become contended.

UseBiasedLocking is on by default in Mustang for both -client and -server (though there's some discussion on disabling it by default on -client due to startup impacts). I'd be really interested to hear of any client (read Swing GUI) applications that benefit from -XX:+UseBiasedLocking (particularly the Mustang implementation, which is subtly different from the 1.5.0_06 implementation).

HTH
Brian