Skip to main content

AMD64 server anomaly

5 replies [Last post]
tackline
Offline
Joined: 2003-06-19

While running someone else's microbenchmark, I found a case of AMD64 server VM running unexpectedly slow. The code below exercises a singleton with a volatile a variable. When running in 64-bit mode, the server VM runs as slow as the client VM in either 32- or 64-bit mode.

$ ~/mustang/jdk1.6.0/bin/java -showversion -server -d64 SingletonPerformance
java version "1.6.0-rc"
Java(TM) SE Runtime Environment (build 1.6.0-rc-b103)
Java HotSpot(TM) 64-Bit Server VM (build 1.6.0-rc-b103, mixed mode)

100000000 times: 1048608344 ns
100000000 times: 1025415307 ns
100000000 times: 1022701124 ns
100000000 times: 1022228510 ns
100000000 times: 1029915038 ns

$ ~/mustang/jdk1.6.0/bin/java -showversion -server -d32 SingletonPerformance
java version "1.6.0-rc"
Java(TM) SE Runtime Environment (build 1.6.0-rc-b103)
Java HotSpot(TM) Server VM (build 1.6.0-rc-b103, mixed mode)

100000000 times: 365877552 ns
100000000 times: 308058603 ns
100000000 times: 302266772 ns
100000000 times: 304585020 ns
100000000 times: 302302874 ns

The machine is an Ultra-20 with an Opteron 148 running "SunOS unknown 5.10 Generic_118855-19 i86pc i386 i86pc".

Any ideas what's up with it?

[code]
class SingletonPerformance {
public static void main(String[] args) {
final int times = Integer.getInteger("times", 100*1000*1000);
for (int ct=0; ct<5; ++ct) {
long start = System.nanoTime();
run(times);
long end = System.nanoTime();
System.out.println(times+" times: "+(end-start)+" ns");
}
}
private static void run(int times) {
for (int ct=0; ct

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
tackline
Offline
Joined: 2003-06-19

Looking at the cycles per iteration the times actually seem unfeasibly fast.

So I wrote a little test. It fails on 1.5.0_09-b01 and 1.6.0-rc-b104. Is this a bug?

[Update: I'm not sure about the rate of thread switches. Increasing the sleep from 10 to 100 ms decreases the frequency of errors, but does not eliminate them. Possibly that's due to recompilation pauses? Is the unexpected speed due to loop unrolling?]

[code]class VolatileTest {
public static void main(String[] args) {
final long times = Long.getLong("times", 1000*1000*1000);
final Thread mainThread = Thread.currentThread();
Thread thread = new Thread(new Runnable() { public void run() {
try {
execute(times);
} finally {
mainThread.interrupt();
}
}});
/*thread.setPriority(3);*/
thread.start();
try {
check();
} catch (InterruptedException exc) {
/* End of test. */
}
}
private static volatile int j;
private static volatile int k;
private static void execute(long times) {
for (long ct=0; ct j = k;
}
j = -1;
}
private static void check() throws InterruptedException {
int value = 0;
for (;; ) {
k = value;
Thread.sleep(10);
int read = j;
if (read == -1) {
return;
}
if (read != value) {
throw new RuntimeException(java.text.MessageFormat.format(
"Expected {0}, was {1}",
value, read
));
}
++value;
}
}
}[/code]

Message was edited by: tackline

briand
Offline
Joined: 2005-07-11

I believe that it's the cost of the volatile int that's causing the slow down, but that's not the only problem with this code. This code is using the 'Double Checked Locking' idiom which is a much discussed anti-pattern. See Item #48 in Josh Bloch's Effective Java and the following article:

http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html

Eliminating the double checked locking is not going to help the performance issue, though. However, the performance issue is a classic microbenchmark issue. Essentially, the costs is coming from ++i, where i is declared to be volatile. No real application is going increment such a variable as frequently as this code does.

If you remove the volatile modifier from integer i and use the Holder pattern described in Effective Java instead of the double checked locking, you'll get better results.

Here's how I modified the class:

[code]
class SingletonPerformanceHolder {
static final SingletonPerformance instance = new SingletonPerformance();
}

class SingletonPerformance {
private static int i;

public static void main(String[] args) {
final int times = Integer.getInteger("times", 100*1000*1000);
for (int ct=0; ct<5; ++ct) {
long start = System.nanoTime();
run(times);
long end = System.nanoTime();
System.out.println(times+" times: "+(end-start)+" ns, i = " + i);
}
}
private static void run(int times) {
for (int ct=0; ct getInstance().call();
}
}

public static SingletonPerformance getInstance() {
return SingletonPerformanceHolder.instance;
}

public void call() {
++i;
}
}
[/code]

Note that I use the Holder pattern to initialize the instance reference and changed int i to be static and non-volatile and preserved the result by printing it out.

Here's my results on a Sun W2100Z with 2x2.6GHz Opterons running Solaris 10:

Original code:

topflite> java -client OrigSingletonPerformance
100000000 times: 851757640 ns
100000000 times: 858506425 ns
100000000 times: 869869707 ns
100000000 times: 880466264 ns
100000000 times: 873599771 ns
topflite> java -server OrigSingletonPerformance
100000000 times: 274721435 ns
100000000 times: 272819143 ns
100000000 times: 270726390 ns
100000000 times: 269221667 ns
100000000 times: 269125573 ns

Modified version:
topflite> java -client SingletonPerformance
100000000 times: 270540417 ns, i = 100000000
100000000 times: 251665695 ns, i = 200000000
100000000 times: 252359276 ns, i = 300000000
100000000 times: 256313951 ns, i = 400000000
100000000 times: 252289086 ns, i = 500000000
topflite> java -server SingletonPerformance
100000000 times: 12999334 ns, i = 100000000
100000000 times: 8169726 ns, i = 200000000
100000000 times: 8036586 ns, i = 300000000
100000000 times: 8064307 ns, i = 400000000
100000000 times: 8042331 ns, i = 500000000

Of course, your milage may vary depending on how frequently the application accesses the singleton. Clearly, with -server this performs rather well (0.13ns/call vs the original getting 2.74ns/call, worst case).

Message was edited by: briand

Message was edited by: briand

tackline
Offline
Joined: 2003-06-19

I understand double-checked locking is a bad thing (although it does work with volatiles and the new JMM). What I am trying to understand is why and under what circumstances does the AMD64 had dramatically worse performance in 64-bit mode over 32-bit mode. It appears to be related to the use of two volatiles.

w.r.t. code formatting: Open with "[" followed by "code" followed by "]". Close is the same but with "/" inserted in the obvious place.

briand
Offline
Joined: 2005-07-11

You are correct, double checked locking does work with the new JMM because the semantics of volatile have changed such that accesses to such variables require store barriers, which are expensive. However, it's still not recommended to use double checked locking.

With 64-bit, I get similar results to the 32-bit results:

topflite> java -server -d64 OrigSingletonPerformance
100000000 times: 290735091 ns
100000000 times: 280864210 ns
100000000 times: 247478996 ns
100000000 times: 249781141 ns
100000000 times: 246508721 ns
topflite> java -server -d64 SingletonPerformance
100000000 times: 15661974 ns, i = 100000000
100000000 times: 9715896 ns, i = 200000000
100000000 times: 9677065 ns, i = 300000000
100000000 times: 9694409 ns, i = 400000000
100000000 times: 9662839 ns, i = 500000000

So, I'm not seeing the same difference that you are. I'm running with JDK 6, build 104, which has little difference from build 103. Also, I believe your system is a uni-processor (single core), whereas I have two single core processors; however, I would expect the multi-processor to have more expensive memory barriers than the uni-processor. Even when I disable one of my processors (and trick the JVM into thinking there's only one processor available, thus simulating a true uni-processor system), I still don't get your numbers. Since I can't reproduce it, I can't speculate on what might be the cause of the slowness you are seeing.

What numbers do you get from my modified version of the microbenchmark?

BTW - thanks for the '[' code ']' tip. I was trying various combinations of html tag, none of which worked. Unfortunately, the help and FAQ sections of the website don't mention how to format code (though it does mention how to embolden, italicize, or underline text, as if I couldn't figure that out ;-)).

ahmetaa
Offline
Joined: 2003-06-21

If you are using Solaris, why not analyze it with DTrace?