Skip to main content

Tuning Java App on an 8-core Sun Fire T2000 Server

2 replies [Last post]
chrisab
Offline
Joined: 2007-04-10
Points: 0

I am deploying a java app on an 8-core (32 simultaneous threads) Sun Fire T2000 Server running Solaris 10 and have some questions about tuning the app and the JVM for this platform.

One thing that I have noticed is that the JVM starts up with 43 threads right out of the box. Is it normal for the JVM to do this (our JVM is build 1.5.0_09-b03)?

The flow of our application is this: read xml files from the file system (NAS), process data contained within the files, and then send filtered/corrected data in xml format to various WebsShere MQs (data is also archived to file system, NAS again).

The app is configured as two java processes both using a fixed thread pool (java.util.concurrent.Executors.newFixedThreadPool) of 14 worker threads. This configuration, 2 java processes with 14 threads each, has preformed better for us than one java process with 30 threads. Our problem is that even with two java processes running, vmstat reports about 60% user, 20% sys, and 20% idle cpu usage. We had previously worked on removing most of the blocking between the threads but I’m thinking the high system usage from vmstat may mean that something is still there causing the threads to block/wait on each other. Is there anything else that high system usage in vmstat usually reflects in you java app?

In addition to the high sys usage in vmstat, prstat shows that many of our threads are sleeping. The prstat listing below shows that nearly half of the threads are sleeping 15-50% of the time. Our app pulls files from a directory and hands it off to a worker thread to process. For this test we loaded the input directory with plenty of files and therefore none of the worker threads should be sleeping.

Any suggestion on how many threads and/or how many java procs to run to best utilized the T2000?

Thanks,
Chris

> java -version<br />
java version "1.5.0_09"<br />
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_09-b03)<br />
Java HotSpot(TM) Server VM (build 1.5.0_09-b03, mixed mode)</p>
<p>> prstat -mL<br />
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID</p>
<p> 27871 user      90 3.4 0.2 0.0 0.0 4.9 0.4 1.1  3K  1K  8K   0 java/45<br />
 27859 user      87 3.9 0.2 0.0 0.0 7.5 0.0 1.2  4K 749 10K   0 java/47<br />
 27871 user      87 3.7 0.1 0.0 0.0 6.6 1.5 0.8  3K 699  9K   0 java/85<br />
 27871 user      87 3.8 0.1 0.0 0.0 6.8 1.3 0.8  3K 650  9K   0 java/44<br />
 27871 user      87 3.8 0.1 0.0 0.0 6.5 1.6 0.8  3K 745  9K   0 java/50<br />
 27859 user      86 4.0 0.1 0.0 0.0 7.7 0.4 1.3  4K 661 10K   0 java/45<br />
 27859 user      86 4.2 0.2 0.0 0.0 8.1 0.8 0.9  4K 721 10K   0 java/87<br />
 27871 user      84 4.6 0.2 0.0 0.0 4.8 3.7 2.8  3K 839 10K   0 java/67<br />
 27859 user      83 4.8 0.1 0.0 0.0 7.8 3.2 1.3  4K 634 11K   0 java/86<br />
 27859 user      82 5.2 0.1 0.0 0.0 8.1 3.6 0.9  4K 593 12K   0 java/50<br />
 27859 user      82 3.8 0.1 0.0 0.0 7.9 5.2 0.8  3K 489  9K   0 java/58<br />
 27859 user      81 4.8 0.2 0.0 0.0 7.5 5.5 0.9  3K 978 10K   1 java/85<br />
 27871 user      72 7.6 0.2 0.0 0.0 2.5  12 4.9  3K  1K 15K   0 java/65<br />
 27871 user      71 7.7 0.1 0.0 0.0 2.7  13 5.0  3K 884 15K   0 java/49<br />
 27871 user      70 7.6 0.1 0.0 0.0 2.5  15 5.0  3K 612 15K   0 java/52<br />
 27859 user      70 7.5 0.2 0.0 0.0 7.4  13 1.6  4K  1K 15K   1 java/60<br />
 27871 user      68 8.1 0.2 0.0 0.0 4.4  17 1.9  2K  1K 15K   0 java/47<br />
 27859 user      68 7.9 0.2 0.0 0.0 7.6  15 1.1  3K  1K 15K   0 java/52<br />
 27859 user      68 8.2 0.1 0.0 0.0 7.7  15 1.0  4K 855 17K   0 java/46<br />
 27871 user      65 9.3 0.2 0.0 0.0 2.7  18 5.3  3K  1K 18K   0 java/51<br />
 27871 user      60 9.8 0.1 0.0 0.0 4.4  23 2.8  3K 740 19K   0 java/66<br />
 27859 user      58 9.9 0.1 0.0 0.0 7.2  23 1.4  3K 903 19K   1 java/44<br />
 27871 user      58  10 0.1 0.0 0.0 1.9  25 100  3K 993 19K   0 java/46<br />
 27859 user      52  12 0.2 0.0 0.0 5.8  28 2.3  3K  1K 23K   0 java/51<br />
 27859 user      48  12 0.1 0.0 0.0 7.1  32 1.0  3K 724 22K   0 java/49<br />
 26917 user      24  22 0.1 0.0 0.0  28  24 1.5  4K 230 54K   0 amqrmppa/648674<br />
 27871 user      11  35 0.0 0.0 0.0 0.0  54 0.3 300 163 19K   0 java/43<br />
 27859 user      10  35 0.0 0.0 0.0 1.5  50 2.9 427 175 17K   0 java/43<br />
 27871 user      24  18 0.1 0.0 0.0 2.1  51 5.3  3K  1K 32K   0 java/84<br />
 26917 user      15  16 0.0 0.0 0.0  21  45 1.8  4K 487 41K   0 amqrmppa/648673<br />
 26917 user      13  14 0.0 0.0 0.0  15  57 1.2  4K 253 36K   0 amqrmppa/648672<br />
 20077 user      13  14 0.0 0.0 0.0  15  56 1.8  4K 934 37K   0 amqrmppa/17481<br />
 26910 user     4.6  19 0.0 0.0 0.0  55  21 1.2  3K 128  7K   0 amqzmuc0/3<br />
 26917 user      11  12 0.1 0.0 0.0  16  59 1.3  3K 344 29K   0 amqrmppa/648671<br />
 26917 user     8.7  11 0.0 0.0 0.0  23  56 1.5  4K 183 30K   0 amqrmppa/648665<br />
 26917 user     8.2  11 0.1 0.0 0.0  21  59 1.4  4K 345 30K   0 amqrmppa/648666<br />
 20077 user     8.2  11 0.1 0.0 0.0  16  64 1.5  4K 888 28K   0 amqrmppa/17477<br />
 28033 user     9.0 7.6 0.0 0.0 0.0  83 0.0 0.7  1K 139 19K   0 amqzlaa0_nd/162<br />
 28293 user     8.7 4.8 0.0 0.0 0.0  86 0.0 0.8  2K 146 15K   0 amqzlaa0_nd/17<br />
 27371 user     8.9 4.5 0.1 0.0 0.0  86 0.0 0.9  2K 173 13K   0 amqzlaa0_nd/3</p>
<p>Total: 148 processes, 506 lwps, load averages: 27.63, 23.88, 16.20

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
briand
Offline
Joined: 2005-07-11
Points: 0

> I am deploying a java app on an 8-core (32
> simultaneous threads) Sun Fire T2000 Server running
> Solaris 10 and have some questions about tuning the
> app and the JVM for this platform.
>
> One thing that I have noticed is that the JVM starts
> up with 43 threads right out of the box. Is it normal
> for the JVM to do this (our JVM is build
> 1.5.0_09-b03)?

Yes. A 32-thread, 8 core T2000 is a server class machine (2 or more CPUs
AND 2GB or more of RAM and (Solaris or Linux or Windows-64 bit). Server
class machines use the Parallel Scavenge (-XX:+UseParallelGC) garbage
collector by default and we instantiate a parallel gc thread for each available
CPU. The 32-thread T2000 appears to have 32 CPUs to the JVM, so 32
parallel GC threads are instantiated. In addition, there are 2 compiler threads
and a JVM thread. If your application is an RMI or CORBA based app, the
ORB itself is also instantiating threads. Furthermore, may apps and app servers
instantiate thread pools based on the number of available processors.

Note that this default number of parallel GC threads is not always optimal,
especially when the young gen is smallish. In JDK 6.0, the number of gc
threads started by default is 5/8 the number of available CPUs; this mirrors
the default used by parallel young gen collector used with the CMS
garbage collector.

>
> The flow of our application is this: read xml files
> from the file system (NAS), process data contained
> within the files, and then send filtered/corrected
> data in xml format to various WebsShere MQs (data is
> also archived to file system, NAS again).
>
> The app is configured as two java processes both
> using a fixed thread pool
> (java.util.concurrent.Executors.newFixedThreadPool)
> of 14 worker threads. This configuration, 2 java

You might want to consider putting those two processes into processor
sets. The processor sets should be sized in multiples of 4 (4 hw threads
per core) and all from the same core. In this mode, the JVM will recognize
that it's running within a processor set and adjust the number of available
CPUs accordingly, thus reducing the number of parallel gc threads.
Alternatively, you can set the number of parallel GC threads explictly
by setting -XX:ParallelGCThreads=N. If your processor sets have 16
CPUs each, I would probably start up the JVMs with 12 parallel GC
threads each and then experiment a bit with large and smaller numbers
until I found the optimal collection times.

> processes with 14 threads each, has preformed better
> for us than one java process with 30 threads. Our
> problem is that even with two java processes running,
> vmstat reports about 60% user, 20% sys, and 20% idle

The vmstat statistics for the T2000 are not quite accurate. There are
various blog entries and articles on the various Sun websites that
discuss this issue. There are various tools to get better CPU utilization
stats on the T2000 (corestat).

> cpu usage. We had previously worked on removing most
> of the blocking between the threads but I’m thinking
> the high system usage from vmstat may mean that
> something is still there causing the threads to
> block/wait on each other. Is there anything else that
> high system usage in vmstat usually reflects in you
> java app?

The ParallelGC issue I describe above can be seen as high system
usage. It involves a design characteristic of the parallel gc worker
pool called 'work stealing'. When parallelgc threads finish their
allocation of work and other parallel gc threads threads still have
work to do, those threads will steal work from the other threads. This
tends to bang on a lock.

>
> In addition to the high sys usage in vmstat, prstat
> shows that many of our threads are sleeping. The
> prstat listing below shows that nearly half of the
> threads are sleeping 15-50% of the time. Our app
> pulls files from a directory and hands it off to a
> worker thread to process. For this test we loaded the
> input directory with plenty of files and therefore
> none of the worker threads should be sleeping.

Did you increase the size of your Java heaps on the T2000? Each
application threads is going to need memory and with 32 available
CPUs and corresponding application threads, you'll need more Java
heap memory than on 'smaller' systems. It's best to increase the overall
heap size and the size of the young generation (-Xmn or NewSize/MaxNewSize).

You should collect some GC statistics (-verbosegc -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps) and see how often GC is running and for
how long and then tune appropriately. You'll need to do this for both
processes. If GC is not the bottleneck, then the next step is to profile your
application to find out where the bottleneck is at. There are various tools
available for this, ranging from the jvmti based hprof profiler to the netbeans
profile to the SunStudio analyzer and others. Each has it's advantages and
disadvantages. I usually start with hprof.

>
> Any suggestion on how many threads and/or how many
> java procs to run to best utilized the T2000?

It's application dependent. You may want to fire up even more threads than
you currently are using, but it's probably best find your current bottleneck first.

You might also want to check if floating point usage is an issue. Look for
the cooltools, which includes a tool that check for floating point usage.

>
> Thanks,
> Chris
>
>
> [code]> java -version
> java version "1.5.0_09"
> Java(TM) 2 Runtime Environment, Standard Edition
> (build 1.5.0_09-b03)
> Java HotSpot(TM) Server VM (build 1.5.0_09-b03, mixed
> mode)
>
> prstat -mL
> PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX
> ICX SCL SIG PROCESS/LWPID
>
> 27871 user 90 3.4 0.2 0.0 0.0 4.9 0.4 1.1 3K
> 1K 8K 0 java/45
> 7859 user 87 3.9 0.2 0.0 0.0 7.5 0.0 1.2 4K 749
> 10K 0 java/47
> 27871 user 87 3.7 0.1 0.0 0.0 6.6 1.5 0.8 3K
> 699 9K 0 java/85
> 27871 user 87 3.8 0.1 0.0 0.0 6.8 1.3 0.8 3K
> 650 9K 0 java/44
> 27871 user 87 3.8 0.1 0.0 0.0 6.5 1.6 0.8 3K
> 745 9K 0 java/50
> 27859 user 86 4.0 0.1 0.0 0.0 7.7 0.4 1.3 4K
> 661 10K 0 java/45
> 27859 user 86 4.2 0.2 0.0 0.0 8.1 0.8 0.9 4K
> 721 10K 0 java/87
> 27871 user 84 4.6 0.2 0.0 0.0 4.8 3.7 2.8 3K
> 839 10K 0 java/67
> 27859 user 83 4.8 0.1 0.0 0.0 7.8 3.2 1.3 4K
> 634 11K 0 java/86
> 27859 user 82 5.2 0.1 0.0 0.0 8.1 3.6 0.9 4K
> 593 12K 0 java/50
> 27859 user 82 3.8 0.1 0.0 0.0 7.9 5.2 0.8 3K
> 489 9K 0 java/58
> 27859 user 81 4.8 0.2 0.0 0.0 7.5 5.5 0.9 3K
> 978 10K 1 java/85
> 27871 user 72 7.6 0.2 0.0 0.0 2.5 12 4.9 3K
> 1K 15K 0 java/65
> 7871 user 71 7.7 0.1 0.0 0.0 2.7 13 5.0 3K 884
> 15K 0 java/49
> 27871 user 70 7.6 0.1 0.0 0.0 2.5 15 5.0 3K
> 612 15K 0 java/52
> 27859 user 70 7.5 0.2 0.0 0.0 7.4 13 1.6 4K
> 1K 15K 1 java/60
> 27871 user 68 8.1 0.2 0.0 0.0 4.4 17 1.9 2K
> 1K 15K 0 java/47
> 27859 user 68 7.9 0.2 0.0 0.0 7.6 15 1.1 3K
> 1K 15K 0 java/52
> 27859 user 68 8.2 0.1 0.0 0.0 7.7 15 1.0 4K
> 855 17K 0 java/46
> 27871 user 65 9.3 0.2 0.0 0.0 2.7 18 5.3 3K
> 1K 18K 0 java/51
> 7871 user 60 9.8 0.1 0.0 0.0 4.4 23 2.8 3K 740
> 19K 0 java/66
> 27859 user 58 9.9 0.1 0.0 0.0 7.2 23 1.4 3K
> 903 19K 1 java/44
> 27871 user 58 10 0.1 0.0 0.0 1.9 25 100 3K
> 993 19K 0 java/46
> 27859 user 52 12 0.2 0.0 0.0 5.8 28 2.3 3K
> 1K 23K 0 java/51
> 7859 user 48 12 0.1 0.0 0.0 7.1 32 1.0 3K 724
> 22K 0 java/49
> 26917 user 24 22 0.1 0.0 0.0 28 24 1.5 4K
> 230 54K 0 amqrmppa/648674
> 27871 user 11 35 0.0 0.0 0.0 0.0 54 0.3 300
> 163 19K 0 java/43
> 27859 user 10 35 0.0 0.0 0.0 1.5 50 2.9 427
> 175 17K 0 java/43
> 27871 user 24 18 0.1 0.0 0.0 2.1 51 5.3 3K
> 1K 32K 0 java/84
> 26917 user 15 16 0.0 0.0 0.0 21 45 1.8 4K
> 487 41K 0 amqrmppa/648673
> 26917 user 13 14 0.0 0.0 0.0 15 57 1.2 4K
> 253 36K 0 amqrmppa/648672
> 20077 user 13 14 0.0 0.0 0.0 15 56 1.8 4K
> 934 37K 0 amqrmppa/17481
> 26910 user 4.6 19 0.0 0.0 0.0 55 21 1.2 3K
> 128 7K 0 amqzmuc0/3
> 26917 user 11 12 0.1 0.0 0.0 16 59 1.3 3K
> 344 29K 0 amqrmppa/648671
> 26917 user 8.7 11 0.0 0.0 0.0 23 56 1.5 4K
> 183 30K 0 amqrmppa/648665
> 26917 user 8.2 11 0.1 0.0 0.0 21 59 1.4 4K
> 345 30K 0 amqrmppa/648666
> 20077 user 8.2 11 0.1 0.0 0.0 16 64 1.5 4K
> 888 28K 0 amqrmppa/17477
> 28033 user 9.0 7.6 0.0 0.0 0.0 83 0.0 0.7 1K
> 139 19K 0 amqzlaa0_nd/162
> 28293 user 8.7 4.8 0.0 0.0 0.0 86 0.0 0.8 2K
> 146 15K 0 amqzlaa0_nd/17
> 27371 user 8.9 4.5 0.1 0.0 0.0 86 0.0 0.9 2K
> 173 13K 0 amqzlaa0_nd/3
>
> otal: 148 processes, 506 lwps, load averages: 27.63,
> 23.88, 16.20[/code]

Note that your amqr* processes have some really large lwpids and have
rather high icx (involuntary context switch) rates, relative to your java theads.
You might want to investigate this process too.

HTH,
Brian

crofts
Offline
Joined: 2003-06-11
Points: 0

> You might want to consider putting those two processes into processor
> sets. The processor sets should be sized in multiples of 4 (4 hw threads
> per core) and all from the same core.

As I understand it the threads of a core are cycled in round-robin order, so wouldn't it be better to spread a processor set across multiple cores? For example, three processor sets could each have one thread from the eight cores (the four set would be the default pool).

Comments anyone?