Skip to main content

Performance difference between Linux & Windows

15 replies [Last post]
mipper
Offline
Joined: 2007-06-27
Points: 0

I'm having a lot of trouble reconciling the performance of Java on Windows versus Linux. The performance of a simple looping procedure appears to be an order of magnitude slower on Linux compared to the same code running on Windows.

Both tests were using JDK 1.5.0_07 and the Linux version had -server included on the command line. I ran the tests on a single dual boot machine runing Windows 2000 professional and Kubuntu 7.04 (AMD64 Kernel 2.6.20). The machine is a AMD Athlon 64 X2 4200 with 2Gb of RAM.

There are two test each of which loops around appending to, you guessed it, either a String or a StringBuffer. I read somewhere on the web that getting the system time to the nano-second was expensive on Linux, so I've written two versions of the loop. One which times each individual call inside to loop, and one which only records the total time spent inside the loop. When taking the timing of every call I take averages of the total time taken inside the loop which includes one call to the clock, and also the total time spend inside the loop which will include both calls to the clock for each iteration.

I run the entire set of test 100 times to try to ensure that HotSpot has the opportunity to optimise the code where required, and you can certainly see two distinct points where it would appear it's kicking in.

So the big question: Why does the code run so incrediblly slowly under Linux?

Taking the run of the tests that was done 5th from the end in each case we get the following results which are pretty representitive of the results from each iteration:

L everyCall FINISH: BUFFER Total: 0.455 ms, Avg: 2,480/4,550 ns
W everyCall FINISH: BUFFER Total: 0.07 ms, Avg: 400/703 ns

L onlyOutside FINISH: BUFFER Total: 0.061 ms, Avg: 610 ns
W onlyOutside FINISH: BUFFER Total: 0.008 ms, Avg: 80 ns

L everyCall FINISH: STRING Total: 0.713 ms, Avg: 4,620/7,130 ns
W everyCall FINISH: STRING Total: 0.172 ms, Avg: 1,416/1,720 ns

L onlyOutside FINISH: STRING Total: 0.261 ms, Avg: 2,610 ns
W onlyOutside FINISH: STRING Total: 0.097 ms, Avg: 970 ns

And the code:

import java.text.MessageFormat;

public class Timing
{

public static void main ( String[] args )
{
Timing t = new Timing ();
for ( int i = 0; i < 100; i++ )
{
t.everyCall ( "BUFFER", t.new BufCommand () );
t.onlyOutside ( "BUFFER", t.new BufCommand () );
t.everyCall ( "STRING", t.new StrCommand () );
t.onlyOutside ( "STRING", t.new StrCommand () );
System.out.println ( "DONE\n" );
}
}

private void everyCall ( String desc, Command cmd )
{
long total = 0;
long oursideStart = System.nanoTime ();
for ( int i = 0; i < COUNT; i++ )
{
long start = System.nanoTime ();
cmd.doStuff ( i );
total += System.nanoTime () - start;
}
long diff = System.nanoTime () - oursideStart;
System.out.println ( MessageFormat.format ( "everyCall FINISH: {0} Total: {1} ms, Avg: {2}/{3} ns",
desc,
diff / 1000000.0,
total / COUNT,
diff / COUNT ) );
}

private void onlyOutside ( String desc, Command cmd )
{
long start = System.nanoTime ();
for ( int i = 0; i < COUNT; i++ )
{
cmd.doStuff ( i);
}
long diff = System.nanoTime () - start;
System.out.println ( MessageFormat.format ( "onlyOutside FINISH: {0} Total: {1} ms, Avg: {2} ns",
desc,
diff / 1000000.0,
diff / COUNT ) );
}

abstract class Command
{
public abstract void doStuff ( int i );
}

class BufCommand extends Command
{
private StringBuffer _buf = new StringBuffer ();

public void doStuff ( int i )
{
_buf.append ( i );
}
}

class StrCommand extends Command
{
private String _buf = "";

public void doStuff ( int i )
{
_buf = _buf + i;
}
}

private static final int COUNT = 1000;

}

Regards,

mipper

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
briand
Offline
Joined: 2005-07-11
Points: 0

I missed that 0..100 loop in main() . Also, since _buf is a reference
and we are calling a method on that object, the optimizer is probably not
going to throw that away - it may have other side effects. So, ignore all that -
my bad - sorry. I'm too used to looking at code with these types of issues
and sometimes I see what I want to see :-)

The nanoTime() calls on linux will result in a call to gettimeofday() or to
clock_gettime(), if available. gettimeofday() is known to have some serious
overhead associated with it. clock_gettime() is found in librt.so, but not all
distributions support clock_gettime(). I think you'll see lower overhead
for nanoTime() on those that do.

Your nanotime() overhead actually looks pretty good, so I suspect you are
picking up clock_gettime(). As you can see, though, you are paying a pretty
high cost for calling nanoTime() from within the loop. Also, if the accuracy of
the time source is poor, you may be accentuating that by calling nanoTime()
from within the loop. The onlyOutside() implementation is of more value, IMHO.

What version of javac did you use to compile this code?

There may be some difference in allocation patterns between BufCommand
and StrCommand. There may also be some difference in heap configurations
between Windows and Linux. Can you collect some verbosegc data from
these experiments? I'd suggest the following options:

-verbosegc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails

Also, full command line options used for both platforms would help too.

Brian

mipper
Offline
Joined: 2007-06-27
Points: 0

Again, thank you to both of you for your input.

It's almost impossible to document the whole range of things that you've tried before posting to a board when trying to get to the root of a problem. I've looked into garbage collection being a factor and had ruled it out as having any impact on my tests very soon after putting the test case together. (After all gc is always the first thing we blame for performance issues ;o) I sized the VM in such a way that it required very little in the way of garbage collection by fixing the allocation of heap memory and also fixing the size of the young generation to ensure that I could get multiple runs through without gc kicking in.

For completeness I'll document two runs of a cut down test run to make the output manageable on the forum. I can't, alas, run them on the same hardware at work but I don't believe the hardware explains the issue since the Xeon I'm runing the linux test on produces better results than the Athlon 64 from my original test.

Both machines are using JDK 1.5.0_07 32bit.

The code:
[code]
import java.text.MessageFormat;

public class Timing
{

public static void main ( String[] args )
{
Timing t = new Timing ();
for ( int i = 0; i < 100; i++ )
{
t.onlyOutside ( "BUFFER", t.new BufCommand () );
}
}

private void onlyOutside ( String desc, Command cmd )
{
long start = System.nanoTime ();
for ( int i = 0; i < COUNT; i++ )
{
cmd.doStuff ( i);
}
long diff = System.nanoTime () - start;
System.out.println ( MessageFormat.format ( "onlyOutside Len: {3}, FINISH: {0} Total: {1} ms, Avg: {2} ns",
desc,
diff / 1000000.0,
diff / COUNT,
cmd.getString ().length () ) );
}

abstract class Command
{
public abstract void doStuff ( int i );
public abstract String getString ();
}

class BufCommand extends Command
{
private StringBuffer _buf = new StringBuffer ();

@Override
public void doStuff ( int i )
{
_buf.append ( i );
}

public String getString ()
{
return _buf.toString ();
}
}

private static final int COUNT = 1000;

}
[/code]

Windows XP, Pentium 4 3.60GHz, 1Gb
java -server -verbosegc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+TraceClassUnloading -Xms512m -Xmx512m -XX:NewSize=200m -XX:MaxNewSize=200m Timing

onlyOutside Len: 2,890, FINISH: BUFFER Total: 3.684 ms, Avg: 3,684 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 1.522 ms, Avg: 1,521 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 1.487 ms, Avg: 1,487 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 5.126 ms, Avg: 5,126 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 1.502 ms, Avg: 1,502 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 1.476 ms, Avg: 1,476 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 14.777 ms, Avg: 14,777 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.139 ms, Avg: 138 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.123 ms, Avg: 122 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.123 ms, Avg: 123 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.127 ms, Avg: 126 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.128 ms, Avg: 127 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.105 ms, Avg: 104 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.068 ms, Avg: 68 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.076 ms, Avg: 75 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.068 ms, Avg: 68 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.124 ms, Avg: 124 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.067 ms, Avg: 66 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.069 ms, Avg: 69 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.068 ms, Avg: 67 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.069 ms, Avg: 69 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.071 ms, Avg: 71 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.068 ms, Avg: 68 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.073 ms, Avg: 73 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.068 ms, Avg: 67 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.068 ms, Avg: 68 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.069 ms, Avg: 69 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.067 ms, Avg: 67 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.071 ms, Avg: 70 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.068 ms, Avg: 68 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.076 ms, Avg: 75 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.069 ms, Avg: 69 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.08 ms, Avg: 79 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.073 ms, Avg: 73 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.078 ms, Avg: 77 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.073 ms, Avg: 72 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.073 ms, Avg: 73 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.077 ms, Avg: 76 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.069 ms, Avg: 68 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.078 ms, Avg: 77 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.076 ms, Avg: 76 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.074 ms, Avg: 74 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.078 ms, Avg: 77 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.077 ms, Avg: 76 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.075 ms, Avg: 74 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.078 ms, Avg: 78 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.074 ms, Avg: 73 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.072 ms, Avg: 72 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.083 ms, Avg: 82 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.074 ms, Avg: 73 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.074 ms, Avg: 74 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.076 ms, Avg: 75 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.075 ms, Avg: 74 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.077 ms, Avg: 77 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.079 ms, Avg: 79 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.073 ms, Avg: 72 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.078 ms, Avg: 77 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.074 ms, Avg: 74 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.075 ms, Avg: 75 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.077 ms, Avg: 76 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.074 ms, Avg: 73 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.075 ms, Avg: 74 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.077 ms, Avg: 77 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.076 ms, Avg: 75 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.077 ms, Avg: 77 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.077 ms, Avg: 77 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.074 ms, Avg: 74 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.082 ms, Avg: 82 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.073 ms, Avg: 73 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.075 ms, Avg: 74 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.077 ms, Avg: 76 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.074 ms, Avg: 73 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.075 ms, Avg: 74 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.078 ms, Avg: 77 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.079 ms, Avg: 78 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.077 ms, Avg: 76 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.076 ms, Avg: 76 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.074 ms, Avg: 74 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.077 ms, Avg: 76 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.074 ms, Avg: 73 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.075 ms, Avg: 74 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.078 ms, Avg: 77 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.074 ms, Avg: 73 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.074 ms, Avg: 74 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.077 ms, Avg: 77 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.074 ms, Avg: 74 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.078 ms, Avg: 77 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.08 ms, Avg: 79 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.121 ms, Avg: 120 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.127 ms, Avg: 126 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.105 ms, Avg: 105 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.078 ms, Avg: 77 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.077 ms, Avg: 76 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.073 ms, Avg: 72 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.073 ms, Avg: 72 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.134 ms, Avg: 134 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.121 ms, Avg: 120 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.115 ms, Avg: 114 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.112 ms, Avg: 112 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.075 ms, Avg: 74 ns

Linux 2.4.21, Quad Xeon 2.40GHz, 4Gb
java -server -verbosegc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+TraceClassUnloading -Xms512m -Xmx512m -XX:NewSize=200m -XX:MaxNewSize=200m Timing

onlyOutside Len: 2,890, FINISH: BUFFER Total: 1.836 ms, Avg: 1,836 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 1.891 ms, Avg: 1,891 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 1.837 ms, Avg: 1,837 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 2.345 ms, Avg: 2,345 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 1.917 ms, Avg: 1,917 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 2.615 ms, Avg: 2,615 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 1.982 ms, Avg: 1,982 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 1.825 ms, Avg: 1,825 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 2.369 ms, Avg: 2,369 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 1.864 ms, Avg: 1,864 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 1.822 ms, Avg: 1,822 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 1.808 ms, Avg: 1,808 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 1.765 ms, Avg: 1,765 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 1.818 ms, Avg: 1,818 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 1.748 ms, Avg: 1,748 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 1.799 ms, Avg: 1,799 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.441 ms, Avg: 441 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.408 ms, Avg: 408 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.394 ms, Avg: 394 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.409 ms, Avg: 409 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.399 ms, Avg: 399 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.361 ms, Avg: 361 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.385 ms, Avg: 385 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.358 ms, Avg: 358 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.376 ms, Avg: 376 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.383 ms, Avg: 383 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.311 ms, Avg: 311 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.308 ms, Avg: 308 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.302 ms, Avg: 302 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.291 ms, Avg: 291 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.297 ms, Avg: 297 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.306 ms, Avg: 306 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.275 ms, Avg: 275 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.287 ms, Avg: 287 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.291 ms, Avg: 291 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.282 ms, Avg: 282 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.291 ms, Avg: 291 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.277 ms, Avg: 277 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.296 ms, Avg: 296 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.303 ms, Avg: 303 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.292 ms, Avg: 292 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.301 ms, Avg: 301 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.276 ms, Avg: 276 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.288 ms, Avg: 288 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.288 ms, Avg: 288 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.284 ms, Avg: 284 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.28 ms, Avg: 280 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.295 ms, Avg: 295 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.293 ms, Avg: 293 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.307 ms, Avg: 307 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.313 ms, Avg: 313 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.308 ms, Avg: 308 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.316 ms, Avg: 316 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.306 ms, Avg: 306 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.297 ms, Avg: 297 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.309 ms, Avg: 309 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.309 ms, Avg: 309 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.312 ms, Avg: 312 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.303 ms, Avg: 303 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.29 ms, Avg: 290 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.328 ms, Avg: 328 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.34 ms, Avg: 340 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.311 ms, Avg: 311 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.295 ms, Avg: 295 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.294 ms, Avg: 294 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.32 ms, Avg: 320 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.332 ms, Avg: 332 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.33 ms, Avg: 330 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.342 ms, Avg: 342 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.31 ms, Avg: 310 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.292 ms, Avg: 292 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.299 ms, Avg: 299 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.284 ms, Avg: 284 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.278 ms, Avg: 278 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.354 ms, Avg: 354 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.287 ms, Avg: 287 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.292 ms, Avg: 292 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.325 ms, Avg: 325 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.3 ms, Avg: 300 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.312 ms, Avg: 312 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.302 ms, Avg: 302 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.281 ms, Avg: 281 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.301 ms, Avg: 301 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.291 ms, Avg: 291 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.29 ms, Avg: 290 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.312 ms, Avg: 312 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.288 ms, Avg: 288 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.327 ms, Avg: 327 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.323 ms, Avg: 323 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.318 ms, Avg: 318 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.323 ms, Avg: 323 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.311 ms, Avg: 311 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.29 ms, Avg: 290 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.301 ms, Avg: 301 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.282 ms, Avg: 282 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.299 ms, Avg: 299 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.304 ms, Avg: 304 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.305 ms, Avg: 305 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.305 ms, Avg: 305 ns
onlyOutside Len: 2,890, FINISH: BUFFER Total: 0.357 ms, Avg: 357 ns

briand
Offline
Joined: 2005-07-11
Points: 0

Here's what I suggest at this point:

- Change StringBuffer to StringBuilder and collect some additional results. When I run this on a Solaris 10 amd64 machine, I'm seeing better results with StringBuilder, so I'm curious to see what you get.

- With both the original StringBuffer version and the StringBuilder version, change COUNT to 1000000 and then run with the following:

[code]
java -server -Xrunhprof:cpu=samples -Xms512m -Xmx512m -XX:NewSize=200m -XX:MaxNewSize=200m Timing
[/code]

Changing count puts more emphasis on the inner loop and reduces the ratio of cruft to good stuff. However, it also results in more expansions of the underlying char[] and therefore puts a little pressure on the garbage collector. By putting more emphasis on the inner loop, the profiles collected by hprof will be more revealing.

Send me the resulting java.hprof.txt files from both windows and linux. email address is
brian d.o.t. doherty a@t sun d.o.t. com.

- I'm a little curious about the performance blips that occur after the JIT has warmed things up. I suspect they are a result of first-touches of heap pages. JDK 6 has a feature that touches all the pages of the heap at startup, which helps avoid these transients. So, if you don't mind, try a run (with the StringBuffer version of the COUNT=1000 code) with JDK 6 and the -XX:+AlwaysPreTouch argument added to the command line. I think those blips will go away with this feature.

mipper
Offline
Joined: 2007-06-27
Points: 0

FIrstly my apologies for the delay in replying. It was due to an unforseen absence from work.

I've made the changes you suggested and re-run the test on both platforms. The command line in both cases was:

java -server -Xrunhprof:cpu=samples -Xms512m -Xmx512m -XX:NewSize=200m -XX:MaxNewSize=200m Timing

I'm now going to try to find the time to have a look at them and figure out how to read them and what they mean. In the meantime, I look forward to your interpretation of the data.

----------------------------------------
LINUX

[code]

onlyOutside Len: 5,888,890, FINISH: BUFFER Total: 292.083 ms, Avg: 292 ns
onlyOutside Len: 5,888,890, FINISH: BUILDER Total: 160.988 ms, Avg: 160 ns
onlyOutside Len: 5,888,890, FINISH: BUFFER Total: 286.058 ms, Avg: 286 ns
onlyOutside Len: 5,888,890, FINISH: BUILDER Total: 151.379 ms, Avg: 151 ns
onlyOutside Len: 5,888,890, FINISH: BUFFER Total: 288.443 ms, Avg: 288 ns
onlyOutside Len: 5,888,890, FINISH: BUILDER Total: 161.65 ms, Avg: 161 ns
onlyOutside Len: 5,888,890, FINISH: BUFFER Total: 278.429 ms, Avg: 278 ns
onlyOutside Len: 5,888,890, FINISH: BUILDER Total: 184.804 ms, Avg: 184 ns
onlyOutside Len: 5,888,890, FINISH: BUFFER Total: 288.687 ms, Avg: 288 ns
onlyOutside Len: 5,888,890, FINISH: BUILDER Total: 156.523 ms, Avg: 156 ns

[/code]

[code]
CPU SAMPLES BEGIN (total = 2465) Fri Jul 6 10:02:15 2007
rank self accum count trace method
1 16.23% 16.23% 400 300041 java.lang.AbstractStringBuilder.stringSizeOfInt
2 12.01% 28.24% 296 300042 Timing.onlyOutside
3 9.49% 37.73% 234 300112 java.lang.AbstractStringBuilder.expandCapacity
4 9.37% 47.10% 231 300039 java.lang.AbstractStringBuilder.expandCapacity
5 8.88% 55.98% 219 300044 java.lang.Integer.getChars
6 6.49% 62.47% 160 300113 java.lang.Integer.getChars
7 6.25% 68.72% 154 300114 java.lang.System.arraycopy
8 5.56% 74.28% 137 300043 java.lang.System.arraycopy
9 4.10% 78.38% 101 300111 java.lang.AbstractStringBuilder.stringSizeOfInt
10 3.89% 82.27% 96 300117 java.lang.System.arraycopy
11 3.73% 86.00% 92 300046 java.lang.System.arraycopy
12 3.53% 89.53% 87 300122 java.lang.Integer.getChars
13 3.45% 92.98% 85 300116 java.lang.String.
14 3.20% 96.19% 79 300040 java.lang.Integer.getChars
15 2.76% 98.95% 68 300045 java.lang.String.
16 0.28% 99.23% 7 300121 Timing.onlyOutside
17 0.08% 99.31% 2 300038 Timing.onlyOutside
18 0.08% 99.39% 2 300119 java.lang.AbstractStringBuilder.expandCapacity
19 0.04% 99.43% 1 300011 sun.misc.URLClassPath$JarLoader.
20 0.04% 99.47% 1 300033 java.security.SecureClassLoader.getProtectionDomain
21 0.04% 99.51% 1 300037 java.lang.Integer.getChars
22 0.04% 99.55% 1 300063 java.text.NumberFormat.getInstance
23 0.04% 99.59% 1 300104 java.util.regex.Pattern.closure
24 0.04% 99.63% 1 300110 Timing$BuilderCommand.doStuff
25 0.04% 99.68% 1 300115 Timing.onlyOutside
26 0.04% 99.72% 1 300118 java.lang.String.
27 0.04% 99.76% 1 300123 java.lang.Double.doubleToLongBits
28 0.04% 99.80% 1 300124 java.lang.AbstractStringBuilder.
29 0.04% 99.84% 1 300125 java.lang.Double.valueOf
30 0.04% 99.88% 1 300126 java.lang.Double.valueOf
31 0.04% 99.92% 1 300127 java.lang.String.
32 0.04% 99.96% 1 300128 java.io.FileOutputStream.writeBytes
33 0.04% 100.00% 1 300129 java.lang.Long.valueOf
CPU SAMPLES END
[/code]

-----------------------------------------------------
WINDOWS

[code]

onlyOutside Len: 5,888,890, FINISH: BUFFER Total: 93.147 ms, Avg: 93 ns
onlyOutside Len: 5,888,890, FINISH: BUILDER Total: 113.907 ms, Avg: 113 ns
onlyOutside Len: 5,888,890, FINISH: BUFFER Total: 92.022 ms, Avg: 92 ns
onlyOutside Len: 5,888,890, FINISH: BUILDER Total: 85.657 ms, Avg: 85 ns
onlyOutside Len: 5,888,890, FINISH: BUFFER Total: 106.187 ms, Avg: 106 ns
onlyOutside Len: 5,888,890, FINISH: BUILDER Total: 97.696 ms, Avg: 97 ns
onlyOutside Len: 5,888,890, FINISH: BUFFER Total: 92.263 ms, Avg: 92 ns
onlyOutside Len: 5,888,890, FINISH: BUILDER Total: 87.34 ms, Avg: 87 ns
onlyOutside Len: 5,888,890, FINISH: BUFFER Total: 92.798 ms, Avg: 92 ns
onlyOutside Len: 5,888,890, FINISH: BUILDER Total: 87.924 ms, Avg: 87 ns

[/code]

[code]
CPU SAMPLES BEGIN (total = 581) Fri Jul 06 15:01:58 2007
rank self accum count trace method
1 16.87% 16.87% 98 300105 java.lang.Integer.getChars
2 14.29% 31.15% 83 300038 java.lang.Integer.getChars
3 14.11% 45.27% 82 300107 java.lang.System.arraycopy
4 8.78% 54.04% 51 300108 java.lang.AbstractStringBuilder.expandCapacity
5 8.78% 62.82% 51 300039 java.lang.AbstractStringBuilder.stringSizeOfInt
6 7.40% 70.22% 43 300114 java.lang.AbstractStringBuilder.stringSizeOfInt
7 5.68% 75.90% 33 300113 java.lang.System.arraycopy
8 4.99% 80.90% 29 300109 java.lang.AbstractStringBuilder.expandCapacity
9 3.61% 84.51% 21 300121 java.lang.System.arraycopy
10 3.44% 87.95% 20 300118 Timing.onlyOutside
11 3.10% 91.05% 18 300117 java.lang.System.arraycopy
12 2.75% 93.80% 16 300119 java.io.FileOutputStream.writeBytes
13 1.55% 95.35% 9 300122 java.lang.String.
14 1.03% 96.39% 6 300125 java.lang.String.
15 0.69% 97.07% 4 300115 java.lang.AbstractStringBuilder.append
16 0.69% 97.76% 4 300110 java.lang.AbstractStringBuilder.append
17 0.52% 98.28% 3 300116 Timing.onlyOutside
18 0.17% 98.45% 1 300120 java.text.DecimalFormat.applyPattern
19 0.17% 98.62% 1 300123 java.lang.StringBuilder.toString
20 0.17% 98.80% 1 300019 java.util.zip.Inflater.inflateBytes
21 0.17% 98.97% 1 300124 java.lang.AbstractStringBuilder.expandCapacity
22 0.17% 99.14% 1 300126 java.lang.String.charAt
23 0.17% 99.31% 1 300104 Timing.onlyOutside
24 0.17% 99.48% 1 300128 java.text.MessageFormat.subformat
25 0.17% 99.66% 1 300061 java.text.MessageFormat$Field.
26 0.17% 99.83% 1 300127 java.lang.AbstractStringBuilder.expandCapacity
27 0.17% 100.00% 1 300129 java.lang.String.toLowerCase
CPU SAMPLES END
[/code]

Regards,

mipper

krausest1
Offline
Joined: 2003-06-12
Points: 0

I had a similar problem which I posted here http://www.javagaming.org/forums/index.php?topic=13657.msg108868#msg108868

I recommend removing all nanoTime calls and instead using only a single nanoTime around the whole benchmark (just make sure hotspot won't eliminate the benchmark...) and then comparing the performance.

mipper
Offline
Joined: 2007-06-27
Points: 0

Brian,

Thanks again for your feedback. I tried very hard to ensure that I didn't fall into the usual micro benchmarking problems when writing the test code. I'll respond to your suggestions one at a time with my reasoning for why I don't believe they're the cause of my observations.

Firstly, I have run the code on Windows using both the client & server JVMs and see in my results the expected 'steps' in performance as the HotSpot compiler kicks in. As expected you see a very swift improvement under the client JVM and a slightly better improvement which takes a mite longer to achieve under the server JVM. Also, I call the outer loop in the version of the code I posted 100 times, but I have also run the test for loops of 1,000 & 10,000 times with no further improvement on either platform. The iteration that I used for my comparison was the 96th of the outer loop in both cases.

As far as optimising the code away is concerned, I don't believe this is happening as I checked the effect of commenting out the body of the doStuff call and saw an marked drop in the time taken to execute the method on both platforms. (I have however added an accessor method to the Command object and used it to provide additional information in the output command to make sure. The performance is again identical to the version without this on both platforms.)

As a slight aside, it's interesting to note that in the version of the test in which the body of the doStuff has been removed:

[code]
W everyCall Len: 0, FINISH: BUFFER Total: 2.827 ms, Avg: 1,420/2,827 ns
L everyCall Len: 0, FINISH: BUFFER Total: 9.762 ms, Avg: 4,918/9,762 ns

W onlyOutside Len: 0, FINISH: BUFFER Total: 0.013 ms, Avg: 13 ns
L onlyOutside Len: 0, FINISH: BUFFER Total: 0.022 ms, Avg: 22 ns

W everyCall Len: 0, FINISH: STRING Total: 2.819 ms, Avg: 1,412/2,819 ns
L everyCall Len: 0, FINISH: STRING Total: 9.773 ms, Avg: 4,944/9,773 ns

W onlyOutside Len: 0, FINISH: STRING Total: 0.013 ms, Avg: 13 ns
L onlyOutside Len: 0, FINISH: STRING Total: 0.021 ms, Avg: 21 ns
[/code]

a) the empty method still appears to execute nearly twice as fast on windows
b) You can see the marked difference in the time taken to call the nanosecond clock on the two platforms.

The final observation is that the original stimulus for creating the test was an observation of the relative performance of the two platforms on exactly the same data at exactly the same time on a - very performance critical - full blooded application which was running for over an hour on both platforms.

Regards,

mipper

mipper
Offline
Joined: 2007-06-27
Points: 0

And in response to smoov's points I should also point out that all of the results I'm using are from runs which used the server JVM on both platforms.

briand
Offline
Joined: 2005-07-11
Points: 0

Can you try this on a more recent JDK version - something like 1.5.0_12 or 1.6.0_01?
There have been changes to the loop optimizations in some of these releases, which
may make a difference here.

Also, you mention that you are running on a 64-bit linux kernel. Are you also running
a 64-bit JVM on that OS? If so, you may be comparing 32-bit performance vs 64-bit
performance. It might not matter much for this microbenchmark, but it really depends
on what cmd.doStuff() actually does.

Brian

mipper
Offline
Joined: 2007-06-27
Points: 0

Many thanks for that tip Brian. The full code of my test is included below.

In answer to your questions regarding version of the JDK etc. I've actually run this code on kernel 2.6.20 & 2.4.21 using JDK 1.5.0_12 & 1.6.0_1 on a mixture of Linux and Windows boxes and get essentially the same results. I chose the comparison environments based on what is used on the server which I cannot easily change. (Isn't that always the way of it?)

The other thing worth mentioning is that the reason for creating the test in the first place was due to a similiar observed difference between the development & production environments for a real application. My dev WindowsXP box with a Pentium 4 3.6GHz processor and 1Gb of RAM which was groaning under the weight of the number of applications I was running at the time trounced my Quad Xeon 3.2GHz with 4Gb RAM with nothing else running on it.

[code]
import java.text.MessageFormat;

public class Timing
{

public static void main ( String[] args )
{
Timing t = new Timing ();
for ( int i = 0; i < 100; i++ )
{
t.everyCall ( "BUFFER", t.new BufCommand () );
t.onlyOutside ( "BUFFER", t.new BufCommand () );
t.everyCall ( "STRING", t.new StrCommand () );
t.onlyOutside ( "STRING", t.new StrCommand () );
System.out.println ( "DONE\n" );
}
}

private void everyCall ( String desc, Command cmd )
{
long total = 0;
long oursideStart = System.nanoTime ();
for ( int i = 0; i < COUNT; i++ )
{
long start = System.nanoTime ();
cmd.doStuff ( i );
total += System.nanoTime () - start;
}
long diff = System.nanoTime () - oursideStart;
System.out.println ( MessageFormat.format ( "everyCall FINISH: {0} Total: {1} ms, Avg: {2}/{3} ns",
desc,
diff / 1000000.0,
total / COUNT,
diff / COUNT ) );
}

private void onlyOutside ( String desc, Command cmd )
{
long start = System.nanoTime ();
for ( int i = 0; i < COUNT; i++ )
{
cmd.doStuff ( i);
}
long diff = System.nanoTime () - start;
System.out.println ( MessageFormat.format ( "onlyOutside FINISH: {0} Total: {1} ms, Avg: {2} ns",
desc,
diff / 1000000.0,
diff / COUNT ) );
}

abstract class Command
{
public abstract void doStuff ( int i );
}

class BufCommand extends Command
{
private StringBuffer _buf = new StringBuffer ();

@Override
public void doStuff ( int i )
{
_buf.append ( i );
}
}

class StrCommand extends Command
{
private String _buf = "";

@Override
public void doStuff ( int i )
{
_buf = _buf + i;
}
}

private static final int COUNT = 1000;

}
[/code]

briand
Offline
Joined: 2005-07-11
Points: 0

My initial guess here is that this is a classic micro benchmarking issue. First, take
a look at the body of the benchmark:

[code]
public void doStuff ( int i )
{
_buf.append ( i );
}
[/code]

This is a very simple method, and is likely to get inlined into the body of the
loop, assuming that the JIT has a chance to warm up. Note that the -client JVM,
which you are probably using on Windows, actually warms up sooner than the
-server JVM, which you are running on Linux. So, Windows may have an
advantage here over Linux.

More importantly, since [b]_buf[/b] is never used again, it's likely that the JIT
eliminates the call to [b]_buf.append( i )[/b] and [b]_buf = _buf + i;[/b], thus
resulting in an empty loop body.

To avoid these types of optimizations, you need to make sure that [b]_buf[/b]
is used again. Adding an accessor method to your [b]Command[/b] subclasses
that is called when you generate output may be sufficient. For example, include
the length of the string in the output by calling something like cmd.getStringLenth(),
which returns [b]_buf.length()[/b]. This will prevent the JIT from making such
optimizations.

The difference in the timing of this optimization may be the cause for the difference
in the measured performance. A run with -XX:+PrintCompilation will give an indiaction
of timing of the JIT compilations. Comparing the windows PrintCompilation output to
the Linux PrintCompilation output might provide some additional clues.

Finally, you may want to try more calls to [b]t.everyCall()[/b] and [b]t.onlyOutside()[/b];
it may be that the top level loop is not getting compiled in one runtime but is getting
compiled in the other. I generally make no fewer than 2 calls to a method before
starting my measurements because the JIT won't activate some compiled methods
until the next call, and sometimes 1 call isn't enough to trigger the compilation thresholds.
The actual number of calls for warmup is application dependent, I just find that most
micro benchmarks that I write get away with 2 calls prior to the call that's actually
measured.

smoov
Offline
Joined: 2007-05-01
Points: 0

You might try getting a little closer to apples<->apples by either running the -server VM on Windows or setting the compile threshold on both to the same value. The default (I think) for -client is 1000 invocations and 10,000 for -server.

Since this is a microbenchmark (and you are sure to take the results with a grain of salt), try setting both tests to -XX:CompileThreshold=1000 (or 100). The other differences between client and server are still likely to skew the results, so I'd recommend setting Windows up to run the server VM first:

Copy
\jdk1.6.0_01\jre\bin\server [u]directory[/u] to
\jre1.6.0_01\bin\server.

Another tip on comparing microbenchmark results from system to system is to set the heap initial and max settings high enough so the GC activity doesn't skew the results. -verbose:gc One strategy is to run the warmup section, request a Full GC or 3, wait a short amount of time and then fire off the main test. Also, add -noclassgc JVM option to keep everything loaded.

Please report back what you find.

briand
Offline
Joined: 2005-07-11
Points: 0

These are reasonable suggestions., though setting CompileThreshold the same
for both -client and -server is not likely to help here (as smoov indicated). Just
run -server and/or -client on both systems.

Also, -Xnoclassgc has been known to cause GC performance issues and is generally
not recommended. In fact, there was a proposal not long ago to turn this flag into a
noop, though nothing has happened with that yet. Also, this is such a small app that
it's not likely than any class unloading is occurring, though that could easily be tracked
with -XX:+TraceClassUnloading. Personally, I think the use of -Xnoclassgc is an artifact
of GC algorithms of a bygone era.

Brian

smoov
Offline
Joined: 2007-05-01
Points: 0

Hey Brian, I appreciate your knowledge in this area. I'd like to point out that since this is a microbenchmark and I made suggestions to mostly negate the effects of GC, -Xnoclassgc will likely help by eliminating any unloading activity that might occur. As you suggest -XX:+TraceClassUnloading will reveal any of this activity, which in my particular micro tests ware considerable and by adding the -Xnoclassgc the results of my tests changed.

As with all these suggestions, what might benefit one test might have the opposite affect in another test. What is important to have a number of options or strategies to try.

mipper
Offline
Joined: 2007-06-27
Points: 0

For some reason I don't appear to be able to include the code for my test harness in my post. It cuts off at a line like this:

diff / 10000.0,

And nothing I do to try to stop it seems to work. I don't appear to be able to find any information about why this should be but if anybody can let me know I'll include the code in a reply.

Regards,

mipper

briand
Offline
Joined: 2005-07-11
Points: 0

If you use \[code\] and \[\/code\] pairs (without the backslashes, which I used to avoid the transformation) around your code, you can format it so it's more readable. Similarly, any place where you index an array with the variable [i]i[/i], the forum code will take that as a request to present the text in an [i]italics[/i] font. So, you'll want to be sure to avoid using [i]i[/i] as an index :-/.