Skip to main content

For my array copy test, Mustang is almost 10x slower than Java 1.4?

20 replies [Last post]
marko
Offline
Joined: 2003-06-10
Points: 0

The performance hotspot of my application is copying the contents of a byte array into a char array (many times). Since Java 1.4, i have seen my code get progressively slower? My test runs almost 10x slower under Mustang than Java 1.4?

Is it too late to optimize this for Mustang so i can get the same performance i use to get for Java 1.4?

Following is my simplified test, which repeatedly copies a byte buffer into a char buffer.

The performance numbers for Java versions 1.6, 1.5, and 1.4, running on Windows XP Pro (version 5.1).

class test
{
public static void main( String[] args ) throws Exception
{
byte[] byteBuf = new byte[10000];
char[] buf = new char[10000];

// fill the byte buffer.
for ( int j = 0; j < byteBuf.length; j++ )
{
byteBuf[j] = (byte)j;
}

for ( int x = 0; x < 4; x++ )
{
long start = System.currentTimeMillis();
int total = 0;

for ( int i = 0; i < 100000; i++ )
{
for ( int j = 0; j < byteBuf.length; j++ )
{
int b = byteBuf[j];

total += b;

buf[j] = (char)b;
}
}

System.out.println("total value: " + total);
System.out.println("total time: " + (System.currentTimeMillis() - start));

// confirm that bytes where copied.
for ( int j = 0; j < byteBuf.length; j++ )
{
if ( byteBuf[j] != (byte)buf[j] )
{
throw new RuntimeException();
}
}

Thread.sleep(500);
}
}
}

>>>> RUNTIME OUTPUT.

C:\dev\test\perf>devenv16
C:\dev\test\perf>java -version
java version "1.6.0-beta"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.6.0-beta-b59g)
Java HotSpot(TM) Client VM (build 1.6.0-beta-b59g, mixed mode)

C:\dev\test\perf>javac *.java

C:\dev\test\perf>java -server test
total value: -487200000
total time: 8093
total value: -487200000
total time: 10703
total value: -487200000
total time: 10266
total value: -487200000
total time: 10265

C:\dev\test\perf>devenv15
C:\dev\test\perf>java -version
java version "1.5.0_02"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_02-b09)
Java HotSpot(TM) Client VM (build 1.5.0_02-b09, mixed mode, sharing)

C:\dev\test\perf>javac *.java

C:\dev\test\perf>java -server test
total value: -487200000
total time: 5016
total value: -487200000
total time: 5765
total value: -487200000
total time: 5734
total value: -487200000
total time: 5703

C:\dev\test\perf>devenv14
C:\dev\test\perf>java -version
java version "1.4.1-beta"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.1-beta-b14)
Java HotSpot(TM) Client VM (build 1.4.1-beta-b14, mixed mode)

C:\dev\test\perf>javac *.java

C:\dev\test\perf>java -server test
total value: -487200000
total time: 1531
total value: -487200000
total time: 1203
total value: -487200000
total time: 1172
total value: -487200000
total time: 1156

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
tackline
Offline
Joined: 2003-06-19
Points: 0

> OK I'm really a bit confused here - do we ever get
> guidelines as to write good / fast code to suit
> HotSpot's / Compiler's needs?

The principle guideline appears to be: write good code.

Don't go around manually inlining into large methods or mix different levels of code, just because you think it will be more efficient. Microbenchmarks tend to break this rule, and so often give bad results.

If you deviate from good code to make things faster, check that it does so running your application on your target environment.

spdenne
Offline
Joined: 2006-02-15
Points: 0

I agree that writing good code is a start (so that compiler engineers can concentrate on optimising good code) and that microbenchmarks don't tell you much (the compiler could figure out that the bytecode has no external effect, and skip it entirely) but this thread does show that something changed between java releases, but it may not be entirely related to array copying.

Following analysis using code compiled with 1.5.0_06 javac, and compared with HotSpot server jvms, 1.5.0_06-b05 and 1.6.0-beta2-b83 on WinXP:

eg with this as a routine that is timed (tiger's javac):

private static int performCalc(byte[] byteBuf, char[] buf, int length) {
int total = 0;
for (int i = 0; i < 100000; i++) {
for (int j = 0; j < length; j++) {
int b = byteBuf[j];
total += b;
buf[j] = (char) b;
}
}
return total;
}

mustang is a large amount slower than tiger.

If simple range checks are included:

private static int performCalc(byte[] byteBuf, char[] buf, int length) {
int total = 0;
if (byteBuf.length if (buf.length for (int i = 0; i < 100000; i++) {
for (int j = 0; j < length; j++) {
int b = byteBuf[j];
total += b;
buf[j] = (char) b;
}
}
return total;
}

tiger runs ever so slightly slower, and mustang speeds up a great deal, but still runs 10% slower than tiger.

rossk
Offline
Joined: 2006-02-02
Points: 0

The problem is probably with the OSR compilation that occurs
to allow the transition from interpreter to compiled code
execution. This effect is most pronounced in micro benchmarks.

To mitigate this effect, write a warmup loop to exercise
the code being timed before the timing loop. To ensure
that the code being measured is compiled before the
timing loop, add a print statement after the warmup loop
and add -XX:+PrintCompilation option to the java command.

Running a modified version of the benchmark I get:

java version "1.4.2_12"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_12-b03)
Java HotSpot(TM) Client VM (build 1.4.2_12-b03, mixed mode)
1 test2::redo_copy (27 bytes)
1% test2::do_copy @ 4 (34 bytes)
2 test2::do_copy (34 bytes)
Warmup done. total = -487200000
total value: -487200000
total time: 1550
2% test2::main @ 169 (215 bytes)
total value: -487200000
total time: 1551
total value: -487200000
total time: 1677
total value: -487200000
total time: 1676

java version "1.5.0_08-ea"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_08-ea-b01)
Java HotSpot(TM) Server VM (build 1.5.0_08-ea-b01, mixed mode)
1 test2::redo_copy (27 bytes)
1% test2::do_copy @ 4 (34 bytes)
2 test2::do_copy (34 bytes)
Warmup done. total = -487200000
total value: -487200000
total time: 1618
total value: -487200000
total time: 1618
2% test2::main @ 169 (215 bytes)
total value: -487200000
total time: 1618
total value: -487200000
total time: 1619

java version "1.6.0-beta2"
Java(TM) SE Runtime Environment (build 1.6.0-beta2-b85)
Java HotSpot(TM) Server VM (build 1.6.0-beta2-b85, mixed mode)
1 test2::do_copy (34 bytes)
1% test2::do_copy @ 4 (34 bytes)
2 test2::redo_copy (27 bytes)
Warmup done. total = -487200000
total value: -487200000
total time: 1594
total value: -487200000
total time: 1594
2% test2::main @ 169 (215 bytes)
total value: -487200000
total time: 1594
total value: -487200000
total time: 1606

The script ran
java -server -XX:+PrintCompilation test2
over the various jdk versions.

The modified test is:

class test2 {

public static int do_copy(byte[] byteBuf, char[] buf) {
int total = 0;

for ( int j = 0; j < byteBuf.length; j++ ) {
int b = byteBuf[j];

total += b;

buf[j] = (char)b;
}
return total;
}

public static int redo_copy(byte[] byteBuf, char[] buf, int redo) {
int total = 0;

for ( int i = 0; i < redo; i++ ) {
total += do_copy(byteBuf, buf);
}
return total;
}

public static void main( String[] args ) throws Exception {
byte[] byteBuf = new byte[10000];
char[] buf = new char[10000];

// fill the byte buffer.
for ( int j = 0; j < byteBuf.length; j++ ) {
byteBuf[j] = (byte)j;
}

// Warmup copy loops.
int total = 0;
for (int i = 0; i < 1000; i++ ) {
total += redo_copy(byteBuf, buf, 100);
}
System.out.println("Warmup done. total = "+total);

for ( int x = 0; x < 4; x++ ) {
long start = System.currentTimeMillis();

total = redo_copy(byteBuf, buf, 100000);

long end = System.currentTimeMillis();

System.out.println("total value: " + total);
System.out.println("total time: " + (end - start));

// confirm that bytes where copied.
for ( int j = 0; j < byteBuf.length; j++ ) {
if ( byteBuf[j] != (byte)buf[j] ) {
throw new RuntimeException();
}
}

Thread.sleep(500);
}
}
}

In the -XX:+PrintCompilation line:

1% test2::do_copy @ 4 (34 bytes)

The % in "1%" means this is an OSR compilation.
The 4 in " @ 4" is the bytecode index of where the interpreter execution will switch to compiled code.

In test2.java, the PrintCompilation line:

2 test2::do_copy (34 bytes)

shows that a normal (not OSR) compilation of the
code being timed occurred before the end of the
warmup period. Pretty much guaranteeing that the
timing portion of the test will invoke this version.

Here's the original test case with PrintCompilation,
showing that only OSR versions of method test exist.

java version "1.4.2_12"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_12-b03)
Java HotSpot(TM) Client VM (build 1.4.2_12-b03, mixed mode)
1% test::main @ 59 (203 bytes)
total value: -487200000
total time: 2840
2% test::main @ 157 (203 bytes)
3% test::main @ 59 (203 bytes)
total value: -487200000
total time: 3420
total value: -487200000
total time: 3402
total value: -487200000
total time: 3403

java version "1.5.0_08-ea"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_08-ea-b01)
Java HotSpot(TM) Server VM (build 1.5.0_08-ea-b01, mixed mode)
1% test::main @ 59 (203 bytes)
total value: -487200000
total time: 6309
2% test::main @ 59 (203 bytes)
total value: -487200000
total time: 9711
total value: -487200000
total time: 9678
total value: -487200000
total time: 9679

java version "1.6.0-beta2"
Java(TM) SE Runtime Environment (build 1.6.0-beta2-b85)
Java HotSpot(TM) Server VM (build 1.6.0-beta2-b85, mixed mode)
1% test::main @ 59 (203 bytes)
1% made not entrant test::main @ 59 (203 bytes)
total value: -487200000
total time: 3932
2% test::main @ 59 (203 bytes)
total value: -487200000
total time: 11434
total value: -487200000
total time: 11406
total value: -487200000
total time: 11407

marko
Offline
Joined: 2003-06-10
Points: 0

Thanks for all the feedback. My code was originally written for 1.3, where manually inlining was needed to optimize. The same was true for 1.4, but since then it seems that you get a penalty for manually inlining hotspots (at least most of the time). I agree with one of the other posters, that it would be nice if there was a more deterministic process for working towards optimal code performance, rather than relying on trial and error code transformations.

It is also really odd that bytecode generated at the compiler level can have such a significant effect on VM runtime performance? My code is running about 10% faster using the Eclipse compiler.

Thanks again.

trembovetski
Offline
Joined: 2003-12-31
Points: 0

> My code is running about 10% faster using the Eclipse compiler.

That is indeed weird. Could you try compiling with Sun javac
-source 1.5 -target 1.5 ?

Dmitri

marko
Offline
Joined: 2003-06-10
Points: 0

Here is an example where javac is faster.

class test
{
static final byte[] byteBuf = new byte[10000];
static final char[] buf = new char[10000];

public static void main( String[] args ) throws Exception
{
for ( int x = 0; x < 4; x++ )
{
long start = System.currentTimeMillis();

for ( int i = 0; i < 100000; i++ )
{
work(byteBuf, buf, byteBuf.length);
}

System.out.println( "total time: " + (System.currentTimeMillis() - start) );
}
}

private static int work(byte[] byteBuf, char[] buf, int length)
{
int total = 0;
for (int j = 0; j < length; j++) {
int b = byteBuf[j];
total += b;
buf[j] = (char) b;
}
return total;
}
}

C:\dev\test\perf>javac -source 1.5 -target 1.5 *.java

C:\dev\test\perf>java -server test
total time: 1906
total time: 1328
total time: 1312
total time: 1297

C:\dev\test\perf>java -jar c:\temp\org.eclipse.jdt.core_3.1.0.jar *.java

C:\dev\test\perf>java -server test
total time: 1921
total time: 1750
total time: 1719
total time: 1750

But more often i can get the eclipse compiler to perform faster.

eg.
class test
{
static final byte[] byteBuf = new byte[10000];
static final char[] buf = new char[10000];

public static void main( String[] args ) throws Exception
{
for ( int x = 0; x < 4; x++ )
{
long start = System.currentTimeMillis();

for ( int i = 0; i < 100000; i++ )
{
work(1000, 100);
}

System.out.println( "total time: " + (System.currentTimeMillis() - start) );
}
}

public static int work(int byteIndex, int offset)
{
if ( (byteIndex-offset) < 0 )
throw new RuntimeException();

for ( ; byteIndex < byteBuf.length; byteIndex++ )
{
if ( byteBuf[byteIndex] < 6 )
{
if ( byteBuf[byteIndex] == 5 )
{
buf[byteIndex-offset] = '\n';
}
}
buf[byteIndex-offset] = (char)byteBuf[byteIndex];
}

return 0;
}
}

C:\dev\test\perf>java -server test
total time: 1469
total time: 2515
total time: 2500
total time: 2485

C:\dev\test\perf>java -jar c:\temp\org.eclipse.jdt.core_3.1.0.jar *.java

C:\dev\test\perf>java -server test
total time: 1453
total time: 1500
total time: 1453
total time: 1454

sjasja
Offline
Joined: 2004-08-15
Points: 0

> C:\dev\test\perf>java -server test
> total time: 1469
> total time: 2515
> total time: 2500
> total time: 2485

I see this effect too. Running with -XX:+PrintCompilation looks like main() gets compiled twice and after the second compilation it gets slower...? And splitting the contents of the "for(int x..." loop into a separate method makes the slowdown disappear.

trembovetski
Offline
Joined: 2003-12-31
Points: 0

Build 59g is ancient. Could you try with the latest from http://download.java.net/jdk6/binaries/ ?

Dmitri

alexlamsl
Offline
Joined: 2004-09-02
Points: 0

Here are my times on various JDKs using your test code ;-)

Mustang b82:

total value: -487200000
total time: 3525
total value: -487200000
total time: 3274
total value: -487200000
total time: 3274
total value: -487200000
total time: 3274

Java 5 update 6:

total value: -487200000
total time: 4856
total value: -487200000
total time: 4260
total value: -487200000
total time: 4433
total value: -487200000
total time: 4370

Java 1.3.1_17 b02:

total value: -487200000
total time: 6078
total value: -487200000
total time: 5984
total value: -487200000
total time: 5905
total value: -487200000
total time: 5874

So my conclusion with these results would be that the Client HotSpot is doing quite well in both Tiger and Mustang ;-)

alexlamsl
Offline
Joined: 2004-09-02
Points: 0

Here are some further results with the Server HotSpot.

1.3.1_17 b02:

total value: -487200000
total time: 2788
total value: -487200000
total time: 4840
total value: -487200000
total time: 4825
total value: -487200000
total time: 4825

5.0u6:

total value: -487200000
total time: 4308
total value: -487200000
total time: 3806
total value: -487200000
total time: 3806
total value: -487200000
total time: 3791

Mustang b82:

total value: -487200000
total time: 4715
total value: -487200000
total time: 8067
total value: -487200000
total time: 7942
total value: -487200000
total time: 7926

So Tiger is fine, but Mustang does have a major slowdown :-\

alexlamsl
Offline
Joined: 2004-09-02
Points: 0

I've used the following (simpler?) test code instead:
[code]
class Main {
public static final int LENGTH = 10000;
public static void main( String[] args ) {
byte[] byteBuf = new byte[LENGTH];
char[] buf = new char[LENGTH];

/* fill the byte buffer */
for (int j = 0; j < LENGTH; j++ ) {
byteBuf[j] = (byte)j;
}

long start;
for ( int x = 0, i, j; x < 4; x++ ) {
start = System.currentTimeMillis();
for ( i = 0; i < 100000; i++ ) {
for ( j = 0; j < LENGTH; j++ ) {
buf[j] = (char) byteBuf[j];
}
}
start = System.currentTimeMillis() - start;
System.out.println("total time: " + start);
}
}
}
[/code]
And here are the results...

1.3.1_17 b02:

total time: 2616
total time: 3101
total time: 3086
total time: 3102

5.0u6:

total time: 4307
total time: 7127
total time: 7096
total time: 7111

Mustang b82:

total time: 4339
total time: 7331
total time: 7237
total time: 7252

So my conclusion seems to be that the problem is introduced after 1.3.1 and before Mustang :-\

spdenne
Offline
Joined: 2006-02-15
Points: 0

mustang (b82) server jvm runs the test faster if byteBuf.length is used in the for loop test instead of LENGTH

spdenne
Offline
Joined: 2006-02-15
Points: 0

Another huge speedup (server jvms) is obtained by making byteBuf and buf static fields in the class.
Mustang shows another difference around whether those fields are final or not.

alexlamsl
Offline
Joined: 2004-09-02
Points: 0

Changed in accord to your suggestions ([b]public[/b] / [b]private static final[/b] makes no noticable differences)
[code]
class Main {
private static final int LENGTH = 10000;
private static final byte[] byteBuf = new byte[LENGTH];
private static final char[] buf = new char[LENGTH];

public static void main( String[] args ) throws Exception {

/* fill the byte buffer */
for (int j = 0; j < byteBuf.length; j++ ) {
byteBuf[j] = (byte)j;
}

long start;
for ( int x = 0, i, j; x < 4; x++ ) {
start = System.currentTimeMillis();
for ( i = 0; i < 100000; i++ ) {
for ( j = 0; j < byteBuf.length; j++ ) {
buf[j] = (char) byteBuf[j];
}
}
start = System.currentTimeMillis() - start;
System.out.println("total time: " + start);
}
}
}
[/code]
Here comes the times (Server HotSpot):

Mustang b82:

total time: 3312
total time: 5328
total time: 5110
total time: 5093

5.0u6:

total time: 3063
total time: 5109
total time: 4938
total time: 4937

1.3.1_17 b02:

total time: 2390
total time: 3516
total time: 3484
total time: 3547

So the newer HotSpots are still slower.

One thing I've noticed is that the optimisation(s) applied to the code by HotSpot doesn't seem to do anything good at all - yet it is not "unoptimised" to give back the raw (read: better) performance.

And with the test above it also shows that some optimisations are taken out from Tiger and Mustang that are used to be there. Does remind me of the term "regression" somehow.... :-\

alexlamsl
Offline
Joined: 2004-09-02
Points: 0

Same code, running 50 times:

1.3.1_17 b02:

2359 3422 2156 3688 3656 3672 3657 3671 3672 3688
3687 3672 3656 3672 3657 3671 3719 3656 3672 3688
3656 3672 3656 3735 3671 3672 3672 3703 3672 3656
3672 3672 3672 3656 3672 3672 3656 3672 3656 3672
3672 3656 3672 3672 3656 3688 3812 3672 3672 3672

5.0u6:

2984 2875 3844 3844 3844 3843 3844 3828 3844 3844
3828 3828 3875 3844 3844 3843 3828 3844 3828 3891
3828 3844 3844 3828 3843 3829 3859 3828 3844 3859
3844 3844 3828 3828 3844 3843 3829 3828 3843 3844
3828 3844 3969 3828 3844 3828 3875 3859 3844 3828

Mustang b82:

3125 3562 3578 3563 3609 3594 3563 3562 3563 3531
3562 3547 3578 3532 3562 3547 3531 3563 3547 3562
3547 3562 3547 3547 3578 3610 3562 3547 3563 3562
3547 3547 3547 3547 3562 3547 3562 3563 3562 3547
3547 3547 3578 3563 3547 3546 3563 3547 3547 3547

So looks like the right kinds of optimisations have taken place in this case! Here's the code just to show that nothing important is modified:
[code]
class Main {
private static final int LENGTH = 10000;
private static final byte[] byteBuf = new byte[LENGTH];
private static final char[] buf = new char[LENGTH];

public static void main( String[] args ) {

/* fill the byte buffer */
for (int j = 0; j < byteBuf.length; j++ ) {
byteBuf[j] = (byte)j;
}

long start;
for ( int x = 0, i, j; x < 50; x++ ) {
start = System.currentTimeMillis();
for ( i = 0; i < 100000; i++ ) {
for ( j = 0; j < byteBuf.length; j++ ) {
buf[j] = (char) byteBuf[j];
}
}
start = System.currentTimeMillis() - start;
if (x % 10 == 0) {
System.out.println();
System.out.print(start);
} else {
System.out.print(" " + start);
}
}
}
}
[/code]
Now I change "x < 50" to "x < 4", and here are the times:

Mustang b82:
2985 3547 3562 3703

5.0u6:
3172 5031 2891 2906

1.3.1_17 b02:
2437 3438 2156 3703

Hmmm.... so something did change after all! :-/
Why is HotSpot so difficult (unintuitive) to get along with?

spdenne
Offline
Joined: 2006-02-15
Points: 0

I'm sorry, my analysis was done without recompiling with the various jdk's javac, but rather with eclipse. Interestingly, while javac produces code that mustang jvm processes in 5 to 8 seconds per loop for me, eclipse produces code that mustang jvm processes in 0.9 second per loop.

alexlamsl
Offline
Joined: 2004-09-02
Points: 0

So you mean the bytecode generated by eclipse runs 0.9 seconds on Sun's HotSpot?!

Here are the bytecodes generated for the main method using Mustang b82 btw:
[pre]
0 iconst_0
1 istore_1
2 iload_1
3 getstatic #2
6 arraylength
7 if_icmpge 23 (+16)
10 getstatic #2
13 iload_1
14 iload_1
15 i2b
16 bastore
17 iinc 1 by 1
20 goto 2 (-18)
23 iconst_0
24 istore_3
25 iload_3
26 iconst_4
27 if_icmpge 141 (+114)
30 invokestatic #3
33 lstore_1
34 iconst_0
35 istore 4
37 iload 4
39 ldc #4 <100000>
41 if_icmpge 81 (+40)
44 iconst_0
45 istore 5
47 iload 5
49 getstatic #2
52 arraylength
53 if_icmpge 75 (+22)
56 getstatic #5
59 iload 5
61 getstatic #2
64 iload 5
66 baload
67 i2c
68 castore
69 iinc 5 by 1
72 goto 47 (-25)
75 iinc 4 by 1
78 goto 37 (-41)
81 invokestatic #3
84 lload_1
85 lsub
86 lstore_1
87 iload_3
88 bipush 10
90 irem
91 ifne 110 (+19)
94 getstatic #6
97 invokevirtual #7
100 getstatic #6
103 lload_1
104 invokevirtual #8
107 goto 135 (+28)
110 getstatic #6
113 new #9
116 dup
117 invokespecial #10 >
120 ldc #11 < >
122 invokevirtual #12
125 lload_1
126 invokevirtual #13
129 invokevirtual #14
132 invokevirtual #15
135 iinc 3 by 1
138 goto 25 (-113)
141 return
[/pre]

sjasja
Offline
Joined: 2004-08-15
Points: 0

I occasionally get different results from microbenchmarks if I break the benchmark loop into a separate method (on-stack replacement whatever, "made not entrant" with -XX:+PrintCompilation, zombies?[i][/i]). With this program I see a pretty big difference (on ia32 Windows). Try that:
[code]
public class BufCopy
{
private static final int LENGTH = 10000;
private static final byte[] byteBuf = new byte[LENGTH];
private static final char[] buf = new char[LENGTH];

public static void main( String[] args ) {
/* fill the byte buffer */
for (int j = 0; j < byteBuf.length; j++ ) {
byteBuf[j] = (byte)j;
}
for ( int x = 0; x < 20; x++ )
doit(x);
}

static void doit(int x)
{
long start = System.currentTimeMillis();
for (int i = 0; i < 100000; i++ ) {
for (int j = 0; j < byteBuf.length; j++ ) {
buf[j] = (char) byteBuf[j];
}
}
start = System.currentTimeMillis() - start;
if (x % 10 == 0) {
System.out.println();
System.out.print(start);
} else {
// Commenting out the PRINT-A line and uncommenting PRINT-B
// occasionally makes a bit of a difference for me... YMMV...
System.out.print(" " + start); // PRINT-A
//System.out.print(" "); System.out.print(start); // PRINT-B
}
}
}
[/code]

alexlamsl
Offline
Joined: 2004-09-02
Points: 0

Amazing - I'm now getting just under 0.8s performance!

OK I'm really a bit confused here - do we ever get guidelines as to write good / fast code to suit HotSpot's / Compiler's needs?

I mean this is not just a rant - the seemingly indeterministic yet significant fluctuations in performance of seemingly harmless and slight change in code could be a reliable cause of the impression of "Java is slow"

And to be honest, without any guidelines / recommended practices I would say this is not a misconception at all, because we programmers aren't doing a single thing wrong here!

Anyway enough rant - here're the bytecodes from Mustang b82 compiler:
[pre]
0 invokestatic #4
3 lstore_1
4 iconst_0
5 istore_3
6 iload_3
7 ldc #5 <100000>
9 if_icmpge 49 (+40)
12 iconst_0
13 istore 4
15 iload 4
17 getstatic #2
20 arraylength
21 if_icmpge 43 (+22)
24 getstatic #6
27 iload 4
29 getstatic #2
32 iload 4
34 baload
35 i2c
36 castore
37 iinc 4 by 1
40 goto 15 (-25)
43 iinc 3 by 1
46 goto 6 (-40)
49 invokestatic #4
52 lload_1
53 lsub
54 lstore_1
55 getstatic #7
58 iload_0
59 bipush 10
61 irem
62 ifne 70 (+8)
65 ldc #8 <>
67 goto 72 (+5)
70 ldc #9 < >
72 invokevirtual #10
75 getstatic #7
78 lload_1
79 invokevirtual #11
82 return
[/pre]

alexlamsl
Offline
Joined: 2004-09-02
Points: 0

Notice that the former version contains 78 - 33 = 45 bytes of bytecodes for the benchmark loop, whereas the latter version has 46 - 3 = 43 bytes.

So they look similar at first glance...
[pre]
34 iconst_0 | 4 iconst_0
35 istore 4 | 5 istore_3
37 iload 4 | 6 iload_3
39 ldc #4 <100000> | 7 ldc #5 <100000>
41 if_icmpge 81 (+40) | 9 if_icmpge 49 (+40)
44 iconst_0 |12 iconst_0
45 istore 5 |13 istore 4
47 iload 5 |15 iload 4
49 getstatic #2 |17 getstatic #2
52 arraylength |20 arraylength
53 if_icmpge 75 (+22) |21 if_icmpge 43 (+22)
56 getstatic #5 |24 getstatic #6
59 iload 5 |27 iload 4
61 getstatic #2 |29 getstatic #2
64 iload 5 |32 iload 4
66 baload |34 baload
67 i2c |35 i2c
68 castore |36 castore
69 iinc 5 by 1 |37 iinc 4 by 1
72 goto 47 (-25) |40 goto 15 (-25)
75 iinc 4 by 1 |43 iinc 3 by 1
78 goto 37 (-41) |46 goto 6 (-40)
[/pre]