Skip to main content

1.6 server vm performance

18 replies [Last post]
tmb
Offline
Joined: 2006-02-12

Hi everyone,

I ran some tests to find out how well the 1.6 vm performs. The improvements with the client vm are really great, however I can see absolutely zero improvement with the server vm compared to 1.5.
The test (compressing/decompressing a number of files several thousand times in memory) takes 41% longer with the server vm than with the client vm.

Does anyone have an explanation for this or a suggestion how to improve the server vm performance?

Thomas

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
tmb
Offline
Joined: 2006-02-12

> The problem why server performs so poor is because
> you don't feed it with realistic workload till
> compilation.
> You ever call your methods with "LOOPS=1" so hotspot
> thinks "hey this is strange, I'll optimize for the
> case LOOP=1".
>
Well it doesn't make too much difference. If I change the LOOPS value in your code to what it is in my code I get nearly the same results, it just takes a bit longer:
1.6.0-beta2-b73 / Java HotSpot(TM) Server VM
2.454 secs 2.75 secs 15.765 secs
2.453 secs 2.719 secs 15.766 secs
10.5 secs 7.656 secs 2.797 secs <<----------
10.515 secs 7.657 secs 2.812 secs

Running your code:
1.6.0-beta2-b73 / Java HotSpot(TM) Server VM
0.172 secs 0.172 secs 1.0 secs
0.156 secs 0.172 secs 0.984 secs
0.657 secs 0.484 secs 0.172 secs
0.656 secs 0.469 secs 0.187 secs

1.6.0-beta2-b73 / Java HotSpot(TM) Client VM
0.641 secs 0.625 secs 0.593 secs
0.641 secs 0.625 secs 0.578 secs
0.469 secs 0.469 secs 0.515 secs
0.469 secs 0.469 secs 0.515 secs

I also tried jdk 1.1.8:
null / null
0.454 secs 0.594 secs 0.406 secs
0.453 secs 0.609 secs 0.391 secs
0.469 secs 0.593 secs 0.407 secs
0.453 secs 0.609 secs 0.406 secs
0.454 secs 0.609 secs 0.406 secs

Intriguing ;)
Better results than both 1.6 VMs in test1 when comparing to the compiled code, better in test3 than anything the 1.6 client VM does.

tmarble
Offline
Joined: 2003-08-22

Remember that the server VM design is to make
long running, server applications run fast.
It does this by taking the time to heavily
optimize hot spots in the code.

Because of the relatively short run time
of your tests (~2 min) the server VM is
disadvantaged because you are paying the
expense of warming up the server VM infrastructure
and doing optimized compilation.
If your run times in production will be
short like this then the client VM is probably
the right choice for you.

Do you want to see this for yourself?
Download the visualgc tool [1] for J2SE 1.5
and watch the compiles happen in both
cases. Note that visualgc tool is not yet compatible
with Java 6, but the core jvmstat tools such
as jstat are in the JDK.

Regards,

--Tom

[1] http://java.sun.com/performance/jvmstat

k_v_n
Offline
Joined: 2006-03-01

Tmarble is correct.
Java program starts by executing in interpreter first.
During this time VM collects statistic about bytecode
execution including the number of time a method is called.
When the invocation counter reaches the threshold VM
compiles the method. Client and server VM has
different compilation threshold:
1000 invocations for Client and 10000 for Server.
You can change it by -XX:CompileThreshold=n

Try to set -XX:CompileThreshold=1000 for server VM
for your test.

tmb
Offline
Joined: 2006-02-12

Thanks for your suggestions but it's not warumup time that is causing the performance difference.
I tried running with -XX:CompileThreshold=500 and it does make no difference, at least none that is relevant.
I also tried -XX:+UseParallelGC -XX:+AggressiveOpts.
jstat doesn't show more than -XX:+PrintCompilation does IMHO, so it does not really seem to help here.
Cool'n'Quiet was also disabled while running the test.

To be sure I modified my test to run 10 times as long, the results remain unchanged: server VM 1324.301 secs, client VM 938.205 secs. Note that while running the test the compression method was invoked 602352 times, the decompression method 1204704 times.
Is there a way to get the generated code somehow, so I could compare it?

My theory is that the server VM generates slower code than the client VM as soon as a certain level of complexity in a method is exceeded. I wrote a series of array copy methods for testing and the results of running these support my theory.

I wonder if Sun sees this as a bug that needs to be fixed or just as a (known?) problem the server VM has with some code.

linuxhippy
Offline
Joined: 2004-01-07

Don't forget you also measure the time the server compiler needs for compiling, the server compiler itself is much slower than client since it performs more advanced optimizations.
Don't forget setting down compile threshold makes profiling less useful.

So best would be to first run a warmup-phase which is not taken into account when measuring and afterwards start with your real measurements.

Under normal circumstances the server compiler generated overall 20-40% faster code, but you're right there are some cases where the optimization routines don't generate optimal code.

It would be really great if you could generate a test-case which shows the problem and post it here or better upload it to some location and post the URL.

lg Clemens

tmb
Offline
Joined: 2006-02-12

Running the code below yields the following results on my Opteron 165, jdk 1.6.0-beta2-b73:
Server VM: 9.578 secs 8.136 secs 2.778 secs
Client VM: 7.765 secs 7.61 secs 8.253 secs

So test1() and test2() run both faster on the client VM than on the server VM.

Running the code on an Athlon 1200:
Server VM: 18.649 secs 12.733 secs 4.698 secs
Client VM: 15.264 secs 13.708 secs 13.722 secs

test1() faster with the client VM than with the server VM.

What is strange though is that I get different results across server VM invocations, but constant measurements within one invocation when running on the Opteron.
For example one time I might get:
10.559 secs 8.524 secs 2.398 secs
10.567 secs 8.522 secs 2.402 secs
10.56 secs 8.527 secs 2.399 secs
10.564 secs 8.524 secs 2.403 secs

and the next time:
9.578 secs 8.136 secs 2.778 secs
9.583 secs 8.135 secs 2.784 secs
9.577 secs 8.135 secs 2.774 secs
9.572 secs 8.127 secs 2.779 secs
9.57 secs 8.131 secs 2.781 secs

Pretty strange. Does not seem to happen on the Athlon 1200.

[code]

public class ArrayDemo
{
private final int[] a = new int[20*1024];
private final int[] b = new int[a.length];
private final int LOOPS = 80000;

public ArrayDemo()
{
for(int k = 0; k < 15001; k++) {
test1(1);
test2(1);
test3(1);
}

init = false;

for(int k = 0; k < 5; k++) {
test1(LOOPS);
test2(LOOPS);
test3(LOOPS);
System.out.println();
}
}

private long start;
private boolean init = true;

private void start()
{
start = System.nanoTime();
}
private void end()
{
if(init) return;
long dur = (System.nanoTime() - start) / 1000000;
System.out.print((dur/1000d)+" secs ");
}

private void test1(final int loops)
{
start();

for(int k = 0; k < loops; k++) {
int ia = 0;
int ib = 0;
while(true) {
int t = a.length - ia;
if(t == 0) break;
if(t > 1024) t = 1024;
do {
if(b[ib] == 0xfe)break;
a[ia++] = b[ib++];
} while(--t > 0);
}
}

end();
}

private void test2(final int loops)
{
start();

for(int k = 0; k < loops; k++) {
int ia = 0;
int ib = 0;
while(true) {
int t = a.length - ia;
if(t == 0) break;
if(t > 1024) t = 1024;
do {
if(b[ib] == 0xfe)/*break*/;
a[ia++] = b[ib++];
} while(--t > 0);
}
}

end();
}

private void test3(final int loops)
{
start();

for(int k = 0; k < loops; k++) {
int ia = 0;
int ib = 0;
while(true) {
int t = a.length - ia;
if(t == 0) break;
if(t > 1024) t = 1024;
do {
/* if(b[ib] == 0xfe)break;*/
a[ia++] = b[ib++];
} while(--t > 0);
}
}

end();
}

public static void main(String[] args)
{
System.out.println(System.getProperty("java.vm.version")+" / "+System.getProperty("java.vm.name"));
new ArrayDemo();
}
}

[/code]

linuxhippy
Offline
Joined: 2004-01-07

The problem why server performs so poor is because you don't feed it with realistic workload till compilation.
You ever call your methods with "LOOPS=1" so hotspot thinks "hey this is strange, I'll optimize for the case LOOP=1".

I modified your testcase a bit:
[code]
public class ArrayDemo
{
private final int[] a = new int[20*1024];
private final int[] b = new int[a.length];
private final int LOOPS = 5000;

public ArrayDemo()
{
/* for(int k = 0; k < 15001; k++)
{
test1(1);
test2(1);
test3(1);
}*/

init = false;

for(int k = 0; k < 500; k++) {
test1(LOOPS);
test2(LOOPS);
test3(LOOPS);
System.out.println();
}
}

private long start;
private boolean init = true;

private void end()
{
if(init) return;
long dur = (System.currentTimeMillis() - start);
System.out.print((dur/1000d)+" secs ");
}

private void test1(final int loops)
{
start = System.currentTimeMillis();

for(int k = 0; k < loops; k++) {
int ia = 0;
int ib = 0;
while(true) {
int t = a.length - ia;
if(t == 0) break;
if(t > 1024) t = 1024;
do {
if(b[ib] == 0xfe)break;
a[ia++] = b[ib++];
} while(--t > 0);
}
}

end();
}

private void test2(final int loops)
{
start = System.currentTimeMillis();

for(int k = 0; k < loops; k++) {
int ia = 0;
int ib = 0;
while(true) {
int t = a.length - ia;
if(t == 0) break;
if(t > 1024) t = 1024;
do {
if(b[ib] == 0xfe)/*break*/;
a[ia++] = b[ib++];
} while(--t > 0);
}
}

end();
}

private void test3(final int loops)
{
start = System.currentTimeMillis();

for(int k = 0; k < loops; k++) {
int ia = 0;
int ib = 0;
while(true) {
int t = a.length - ia;
if(t == 0) break;
if(t > 1024) t = 1024;
do {
/* if(b[ib] == 0xfe)break;*/
a[ia++] = b[ib++];
} while(--t > 0);
}
}

end();
}

public static void main(String[] args)
{
System.out.println(System.getProperty("java.vm.version")+" / "+System.getProperty("java.vm.name"));
new ArrayDemo();
}
}
[/code]
and this shows that server performs usually better than client, however theres something strange happening somewhere, usually server gets slower than it already was :-/

Here are my benchmark results with the code above on a Duron800, a quite outdated CPU:
1.6.0-beta2-b72 / Java HotSpot(TM) Server VM
2.871 secs 2.827 secs 5.525 secs
2.742 secs 2.772 secs 5.473 secs
4.665 secs 4.336 secs 2.784 secs <<--- What happens here?
4.661 secs 4.334 secs 2.78 secs

1.4.2_06-b03 / Java HotSpot(TM) Server VM
3.104 secs 3.01 secs 4.953 secs <<-- Here the same, but earlier
4.827 secs 4.2 secs 2.82 secs
4.832 secs 4.207 secs 2.823 secs
4.832 secs 4.208 secs 2.829 secs

1.3.1_16-b06 / Java HotSpot(TM) Server VM
6.312 secs 6.292 secs 5.875 secs
4.729 secs 4.663 secs 4.676 secs
4.72 secs 4.686 secs 4.666 secs
-->> No bug, but slow ;)

So it seems something really goes wrong, when hotspot is done code performs worse than before.

This is the output of PrintCompilation:
1.6.0-beta2-b72 / Java HotSpot(TM) Server VM
1% ArrayDemo::test1 @ 46 (103 bytes)
3.077 secs(1) 2% ArrayDemo::test2 @ 46 (100 bytes)
3.033 secs(2) 3% ArrayDemo::test3 @ 46 (87 bytes)
5.65 secs(3)
1 ArrayDemo::test1 (103 bytes)
2.946 secs(1) 2 ArrayDemo::test2 (100 bytes)
2.975 secs(2) 3 ArrayDemo::test3 (87 bytes)
5.599 secs(3)
4.778 secs 4.425 secs 2.982 secs

However my CPU is outdated and has a very small L2 cache, could anybody please be so kind and try how the modified version behaves on a P4 or/and AMD64 CPU, pleae?

lg Clemens

Message was edited by: linuxhippy

ianschneider
Offline
Joined: 2005-02-11

For what its worth,
here are results from AMD4400+

1.5.0_06-b05 / Java HotSpot(TM) Server VM
0.31 secs 0.331 secs 0.605 secs
0.293 secs 0.311 secs 0.594 secs
0.53 secs 0.505 secs 0.143 secs
0.53 secs 0.506 secs 0.142 secs

1.6.0-rc-b61 / Java HotSpot(TM) Server VM
0.155 secs 0.164 secs 0.881 secs
0.142 secs 0.152 secs 0.87 secs
0.598 secs 0.436 secs 0.176 secs
0.598 secs 0.436 secs 0.175 secs

1.6.0-rc-b61 / Java HotSpot(TM) 64-Bit Server VM
0.491 secs 0.444 secs 0.384 secs
0.476 secs 0.428 secs 0.376 secs
0.557 secs 0.278 secs 0.387 secs
0.557 secs 0.278 secs 0.387 secs

All the runs show a stabilization after 4 cycles.
In my own testing (using real applications) , I have only noticed improvement using Mustang, on both client and server VMs of both 32 and 64 bit.

linuxhippy
Offline
Joined: 2004-01-07

> For what its worth,
Well I guess this is/will become also part of a real application. If it first performs well and after some runs degrades in performance so that its slower than client it seems hotspot is doing something not optimal.
If such cases are reported I guess the java-performance-team is at least able to watch why this happens and by what change this was introduced (this behaviour e.g. came with java-1.4.0) - and they are able to decide to live with that behaviour (since its e.g. a side-effect of something good) or to fix it.
If many such small fixes are done chances are good that overall jvm performance will be better and better (and of course new regressions will appear ;) ).

> here are results from AMD4400+
Thanks a lot for testing, was this with the original or the modified source?

lg Clemens

ianschneider
Offline
Joined: 2005-02-11

> For what its worth,

I didn't mean much by that at all - superfluous prelude - and agree with everything said.

It was with the modified sources.

Client output for comparison:

1.5.0_06-b05 / Java HotSpot(TM) Client VM
0.882 secs 0.883 secs 0.589 secs
0.879 secs 0.88 secs 0.583 secs
0.877 secs 0.879 secs 0.581 secs

1.6.0-rc-b61 / Java HotSpot(TM) Client VM
0.54 secs 0.54 secs 0.507 secs
0.538 secs 0.538 secs 0.505 secs
0.441 secs 0.441 secs 0.452 secs
0.441 secs 0.441 secs 0.452 secs

linuxhippy
Offline
Joined: 2004-01-07

Well seems like it should be reported since C2 deinitivly behaves different than it should.

@tmb would you like to report it? Should I do it?
I would recommend not using the LOOPS=1 source for the report, they try to ignore any report thats somehow strange in any reason or you will at least have to fight to get it acceppted ;). (although I have to admit that LOOPS=1 really makes no difference here).

lg Clemens

tmb
Offline
Joined: 2006-02-12

> Well seems like it should be reported since C2
> deinitivly behaves different than it should.
>
> @tmb would you like to report it? Should I do it?

If you want to report it that's okay with me.

Regards,
Thomas

linuxhippy
Offline
Joined: 2004-01-07

well if nobody would do it, I can do it.
well, just did it ;)

I'll post the bugid as soon as the bug gets accepted.

lg Clemens

mjlt33
Offline
Joined: 2003-07-24

I think some cases could be explained by a combination of time when your JVM decide to compile each method and when your OS decide to change the speed of your processor.

If your process is slower and your OS thinks it needs more power it can decide to speed up your processor making you to see a slower process faster than a faster one with a lower speed processor.

http://www.microsoft.com/downloads/details.aspx?FamilyID=2898f8dd-10f8-4...
http://www.epinions.com/cmd-review-2058-189C8D23-39D90F6F-prod3

Have you take this into account?

linuxhippy
Offline
Joined: 2004-01-07

> Have you take this into account?
Not at all since both machines I tested this did not have any advanced clocking features and this slowdown was consistent across all machines.

lg Clemens

linuxhippy
Offline
Joined: 2004-01-07
k_v_n
Offline
Joined: 2006-03-01

Many performance improvements in 1.6 were backported into 1.5.0_06. So you may not see the difference for server VM.
What system are you using? How you run the test?
Do you run separate VMs to compress/decompress separate files? Or you run in one VM?
It is known that client VM start up is faster then server VM
so if you run separate VMs and each execution time is short
then yes, you can see such difference.

tmb
Offline
Joined: 2006-02-12

The test runs in one vm. Each file is loaded then compressed several thousand times then decompressed several thousand times in memory. Only the time for (de)compression is meassured.

Results for different VMs, Opteron 165 CPU:
1.5.0_05: 120.516 secs
1.5.0_05 -server: 133.784 secs

1.6.0-b73: 93.672 secs
1.6.0-b73 -server: 133.033 secs

As a experiment I replaced all my array copying loops with calls to System.arraycopy in the decompression code, this improved the runtime with the 1.6 server vm substantially but has one drawback: it slows down the client vm as much as it speeds up the server vm.
So this is not really a solution to my problem.