Skip to main content

Profiler for analyzing monitor useage

23 replies [Last post]
linuxhippy
Offline
Joined: 2004-01-07
Points: 0

Hello,

I would like to analyze monitor useage in my application to enhance scaliability.
I already searched a bit for tools which are capable of doing this - however OptimizeIt only analyses monitor useage which could lead to deadlocks.

I would like to have a tool which logs every monitor enter/exit of an application, and which can give me information how much the lock was contendended.
Super would be if it could give me stack-traces from where the monitorenter/exit have been called.

Does anybody know a such a tool or profiler?

Thank you in advance, lg Clemens

PS: The reason why I think I need such a tool is that I always designed my app with scaliability in mind.
Lately I did testing on a 16-cpu machine and discovered it did not scale well - until I found out that ByteArrayOutputStream is fully synchronized, a class which is used a lot in my server application.
I don't know who's descision it was back in 1.0 days to synchronize all and everything but I would like to kill this guy ;)

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
linuxhippy
Offline
Joined: 2004-01-07
Points: 0

@fuerte:
Locking is never - its a very complex topic and especially on modern architectures causes a lot of overhead.
Whereas it can in the uncontended case be implemented quite efficient on single-cpu systems ("just" pipeline stalls and in-cpu troubles) it very likely will at least cause system-bus communication between the CPUs on multiprocessor systems which is magnitudes slower.
Wether you call the keyworkd synchronized or lock does not make any difference ;)

lg Clemens

fuerte
Offline
Joined: 2004-11-22
Points: 0

Look what I found:
http://www.cs.queensu.ca/Java/code/10Steps.html

"[b]5. Minimize use of the Java I/O calls. Write your own![/b]

If you are finding that even buffered Streams, Readers and Writers are too slow you may need to blow off the entire Java I/O library and build your own based upon InputStream.read(byte[]) and OutputStream.write(byte[]).

E.g., if we try to read a sequence of integers using DataInputStream.readInt() we make a synchronized call every 4 bytes. You can get an order of magnitude improvement by using the read(byte[]) methods (and variants) to read large buffers in a single synchronized call, and then using Java's bit operators to build your int's from their constituent bytes. Since this is how the I/O libraries do it anyway the bit manipulations don't cost you anything (except the aggravation of having to do them yourself!). Eventually we may hope that standard I/O libraries will be able to give you something approaching optimal performance. "

cowwoc
Offline
Joined: 2003-08-24
Points: 0

Tiger included changes to the Java Memory Model. What those changes were and how they affect synchronization performance I have no idea.

huntch
Offline
Joined: 2004-12-06
Points: 0

Sorry for joining a little late here :-/

I may have missed something above ... why not use NIO ?

ByteBuffer.put()/get() are not synchronized.

And, if you use the combination of DirectByteBuffer with get()/put() and their corresponding get()/put() for the Java primitive data types and view buffers for put()/get() for arrays of primitive data types you can greatly out perform java.io streams. Especially when you start writing to files and sockets since DirectByteBuffer storage is in native code so you don't have to cross the JNI boundary when you to a file / socket read / write. And, get() / put() operations on DirectByteBuffer use a clever optimization that does not cross the traditional JNI boundary.

I've been doing quite a bit of performance testing benchmarking in this area the past 6 or 8 weeks and I'm seeing the Mustang (Java 6) on Hotspot does a very nice job with optimizations on DirectByteBuffer get() / put(). And, when you throw in the file / socket read / write, it kicks butt compared to a byte[] based implementation.

So, if you can move to an NIO implementation, I'd go that direction instead of using the java.io stream approach. You'll have be aware that the cost of allocating a DirectByteBuffer will exceed the cost of a byte[]. But, you can pool DirectByteBuffers to reduce that cost.

On the topic of lock profiling ... Sun Studio's Collector / Analyzer has a really good lock profiler. I use it all the time for finding scalability locks.

Enjoy!

fuerte
Offline
Joined: 2004-11-22
Points: 0

NIO sounds good. I wonder why in the BufferedInputStream documentation there is no mention of synchronization. There should be a warning for each method that uses it, and a recommendation to use NIO, perhaps even a code example with corresponding NIO class.

linuxhippy
Offline
Joined: 2004-01-07
Points: 0

> There should be a warning for
> each method that uses it, and a recommendation to use
> NIO, perhaps even a code example with corresponding
> NIO class.
Well NIO isn't a good choice everywhere - e.g. in my case I have many different backends (Socket, NIO, HTTP-Java1.1, Http-Java-1.2+, Http J2ME) which behave very different - thats why I read and write into a buffer before flushing everything at once.
Using NIO would make the whole thing much more complex and would be overkill.

lg Clemens

huntch
Offline
Joined: 2004-12-06
Points: 0

I am curious why you think NIO would be overkill in some of your situations ?

If you need data in byte[], you can use HeapByteBuffer where you can 'wrap' the byte[] and get reference to the byte[] too, (not an option for DirectByteBuffer though).

I can think of some situations where you may not want to utilize NIO non-blocking SocketChannels in favor of blocking SocketChannels or java.net.Sockets. I have code that supports both, for legacy reasons and for customers who insist on using java.net.Sockets or want to extend the API with their own Socket (or as I call it a Connection) factory. But, I've also got highly scalable support using non-blocking SocketChannels. ByteBuffers, especially DirectByteBuffers, go hand and hand with SocketChannels.

For those who can utilize Hotspot version 1.5.0_06 can use -XX:+UseBiasedLocking to reduce the overhead of un-contended synchronized locks. I think it's also goes without saying that the overhead of un-contended synchronized locks are not nearly as costly as they used to be.

kutzi
Offline
Joined: 2005-03-29
Points: 0

> But you're right - java core APIs consist of a
> mix&match of over-synchronized classes and completly
> not threadsafe ones (even in java.io -
> ByteArray*Streams are thread-safe while Data*Streams
> are not).

Out of interest: where did you see that Data*Streams are not threadsafe? When I look into the sourcecode of DataOutputStream is see:
public synchronized void write(int b) and
public synchronized void write(byte b[], int off, int len)

Same as in ByteArrayOutputStream.

linuxhippy
Offline
Joined: 2004-01-07
Points: 0

> Out of interest: where did you see that Data*Streams
> are not threadsafe? When I look into the sourcecode
> of DataOutputStream is see:
> public synchronized void write(int b) and
> public synchronized void write(byte b[], int off, int
> len)
>
> Same as in ByteArrayOutputStream.

Yes since these are overwritten from InputStream - but readInt(), writeInt() etc are not threadsafe at all.
If you e.g. call readInt() twice simultaneously you'll end up with a mix of byte of both integers.

lg Clemens

fuerte
Offline
Joined: 2004-11-22
Points: 0

I was surprised to learn that "synchronized" means more than a critical section or a lock on the object. C# has lock(object), is it better than Java "synchronized" then? It does not have any extra penalty I believe. Of course the problem here is that these classes should not be synchronized at all, the synchronization should be left to the programmer.

cowwoc
Offline
Joined: 2003-08-24
Points: 0

How is "synchronized" more than a simple lock? To my knowledge, it *is* a simple lock.

Gili

fuerte
Offline
Joined: 2004-11-22
Points: 0

> How is "synchronized" more than a simple lock? To my
> knowledge, it *is* a simple lock.

http://www-128.ibm.com/developerworks/java/library/j-threads1.html

"[b]What does synchronized really mean?[/b]

Most Java programmers think of a synchronized block or method entirely in terms of enforcing a mutex (mutual exclusion semaphore) or defining a critical section (a block of code which must run atomically). While the semantics of synchronized do include mutual exclusion and atomicity, the reality of what happens prior to monitor entry and after monitor exit is considerably more complicated.

The semantics of synchronized do guarantee that only one thread has access to the protected section at one time, but they also include rules about the synchronizing thread's interaction with main memory. A good way to think about the Java Memory Model (JMM) is to assume that each thread is running on a separate processor, and while all processors access a common main memory space, each processor has its own cache that may not always be synchronized with main memory. In the absence of synchronization, it is allowable (according to the JMM) for two threads to see different values in the same memory location. When synchronizing on a monitor (lock), the JMM requires that this cache be invalidated immediately after the lock is acquired, and flushed (writing any modified memory locations back to main memory) before it is released. It's not hard to see why synchronization can have a significant effect on program performance; flushing the cache frequently can be expensive.
...
[b]How expensive is synchronization?[/b]

Because of the rules involving cache flushing and invalidation, a synchronized block in the Java language is generally more expensive than the critical section facilities offered by many platforms, which are usually implemented with an atomic "test and set bit" machine instruction. Even when a program contains only a single thread running on a single processor, a synchronized method call is still slower than an unsynchronized method call. If the synchronization actually requires contending for the lock, the performance penalty is substantially greater, as there will be several thread switches and system calls required."

Has "synchronized" improved in 1.4 or 1.5?

linuxhippy
Offline
Joined: 2004-01-07
Points: 0

@fuerte:

Sorry but as far as I know the "lock"-keyword of C# does mean monitor-enter/exit instead of just locking, very much the same as Java's synchronized keyword.
At least this is how I understood it when reading docs about C# and multithreading.
Therefore C#'s lock will perform quite equal to Java's synchronized, maybe java is a bit faster because it has been tuned excessivly for this type of operations.
Its not a problem of java nore its memory model, its just an expensive operation, anywhere, always.

Sun and most other java vendors always try to optimize monitor-operations as much as possible and for what I've seen there have been improvements from release to release, mustang also contains some new optimizations which speed up locking. However locking always means inter-processor communication on multi-core/processor systems which is slow by definition.

Mustang includes lock-removal done via escape analysis, its off by default but you can enable it via the command-line argument:
-XX:+DoEscapeAnalysis

This is the best approach, it removes locks on objects which are anyway just visible in one thread - where synchronization would happen for nothing at all.

lg Clemens

sla
Offline
Joined: 2003-06-10
Points: 0

May I suggest JRockit's Runtime Analyzer: http://dev2dev.bea.com/jrockit/tools.html

Here is an example of the information displayed when doing lock profiling (unfortunately it does not give you stacktraces):
http://edocs.bea.com/wljrockit/docs50/usingJRA/looking.html#1056684

Regards,
/Staffan - yes, I work for BEA

linuxhippy
Offline
Joined: 2004-01-07
Points: 0

Hi Staffan,

> May I suggest JRockit's Runtime Analyzer:
> http://dev2dev.bea.com/jrockit/tools.html

Thanks a lot for pointing this out - a really impressive tool. Although stack-traces would be cool I guess they would be simply to expensive.
It should at least help to identify all the thin-lock bottlenecks on multiprocessor systems.

But all the other features seem to be even more impressing - informations no other jvm does give you is shown in a GUI ;)

Thanks again, this is truly unbelievable cool stuff :)

lg Clemens

linuxhippy
Offline
Joined: 2004-01-07
Points: 0

This tool is a dream, especially in conjunction with netbeans' profiler.

Since most monitor operations are caused by synchronized methods I just have a look at the locked object types, go to netbeans and let me display the stack-traces for the class.

This way I was able e.g. identify that my use of BufferedInputStream.read() caused a huge amout of thin locking (one for every byte my application processed) - so just with this simple change I was able to reduce uncontended locking to about 25% of what it has been before :-)

Thanks again, this is greeaaat!

lg Clemens

fuerte2
Offline
Joined: 2005-02-28
Points: 0

May I ask how did you get around the problems in ByteArrayOutputStream and BufferedInputStream?

linuxhippy
Offline
Joined: 2004-01-07
Points: 0

I downloaded the open source implementation GNU-Classpath which contains almost all classes which can also be found in SUN's jvm.

http://gnu.org/software/classpath

I searched for the implementation of ByteArrayInput/OutputStream and BufferedInputStream, renamed it, removed all synchronized-keywords and used that one instead of the classes on the java-core libraries.

However if you would like to enhance scaliability I would recommend to also try JRockit's Runtime Analyzer, its really worth a try although usability is extremly bad ;)

lg Clemens

fuerte2
Offline
Joined: 2005-02-28
Points: 0

Huh, that sounds like quite a lot of work. Wouldn't it be great if there were non-synchronized versions avaiable in Sun's jvm? Just like there is StringBuilder now.

linuxhippy
Offline
Joined: 2004-01-07
Points: 0

Well after identifying that these were bottlenecks it took me about 15min to create the unsynchronized versions and replace all occurances.
For me this is a lot better because I need to stay Java-1.3 compatible.

But you're right - java core APIs consist of a mix&match of over-synchronized classes and completly not threadsafe ones (even in java.io - ByteArray*Streams are thread-safe while Data*Streams are not).
And its so easy to use accidentially use such a class - e.g. for me at least 2 thin-locks for each byte i processed were aquired - this destroys performance on multiprocessor-systems even in the uncontended case.

I think SUN is just afraid of creating a big mess of classes which basically do the same but are or are not threadsafe :-/
However they could be named according to a sheme like Fast*... or whatever.

lg Clemens

cowwoc
Offline
Joined: 2003-08-24
Points: 0

Someone should open a RFE against this I'm sure Sun will make it happen in a future release.

tmarble
Offline
Joined: 2003-08-22
Points: 0

Ig:

Allow me to suggest the NetBeans Profiler.
I haven't used it for monitor analysis, but
the threads stuff might just help you out:

http://profiler.netbeans.org/features.html

Regards,

--Tom

linuxhippy
Offline
Joined: 2004-01-07
Points: 0

Thanks a lot for this tip. I already use Netbeans's profiler and its an great tool, simply wonderful.
However its not able to create that detailed statistics :-/

I also talked to ej-technologies (creator of JProfiler) and they told me that for technical reasons they can only collect data of contended locks.

Thanks for answering, lg Clemens