Posted by mkarg
on January 3, 2010 at 7:27 AM PST
Blue sky, 25°C, the ideal weather for solving strange JNI problems...
Blue sky, 25°C, the ideal weather to solve strange JNI problems. So I spent another valueable free day to solve on of the mysteries of mankind: Why is my ShellExtension crashing? (For those who do not know what a Shell Extension is: In short you could say it is a custom icon in the Windows File Explorer, and I want to have it implemented in Java using JNI).
Everytime XP's Windows Explorer unloads and reloads my DLL (what happens quite often since the explorer seems to be instructed to free as much available memory as possible; I wonder why it not already had a heart stroke of all the stress it makes, looking at the log files) JNI crashs down. In fact I do not understand why. All it does is creating a Java VM using JNI, and in fact, the code is right (otherwise it would not run for the first time, too). So I logged a lot and drilled down to the core of the problem, and here is the result. If you ever want to use JNI, I hope you will remember this:
In fact, the JNI documentation is wrong. It currently says, a process can only created one JVM. In fact that is not really wrong (if you try it a second time, it will crash actually). But they forgot to tell you: You even cannot destroy your JVM and create it again. So actually the documentation should say: "A process can call JNI_CreateJavaVM only once.". In fact, that is what it works like. Sad, but true, you must always remember your JVM and must never destroy it. I wonder why there is a DestroyJavaVM method actually, since it seems to do nothing at all. Also I wonder why the JNI_GetCreatedJavaVMs method returns an array of JVMs, since it always either creates zero or one, but never more than one VM. Strange, isn't it. But again, it is how it works like, and Sun doesn't accept that to be a bug.
In fact I wonder what it should work like to get the correct heap and stack settings, because if someone else already created a JVM with "wrong" settings, you'll have to live with that!
So if you cannot remember of your already created JVM (e. g. because you are implementing a multithreaded DLL like a Shell Extension, or because some other Shell Extension is also based on Java and you do just not know about it -- yes, that's possible, since "One process must only call JNI_CreateJavaVM once" means really ONE PROCESS and not THAT PART OF THE CODE THAT YOU WROTE ONE YOUR OWN) the best algorithm to get a JVM is:
Call JNI_GetCreatedJavaVMs to find out whether there is a JVM already. If there is already one, AttachCurrentThread will allow the current thread to use it. If there is none existing yet, call JNI_CreateJavaVM to create one.
Okay, so I learned about that, but still every other minute my Shell Extensions makes Windows Explorer crash down. Damned. There must be another problem.
And in fact, there is. It took me four hours, but finally I found out (and kicked my own ass for beeing so stupid): A Shell Extension is penetrated by lots of threads in parallel. It seems the Windows Explorer internally uses a myriad of threads in a pool and selects a thread by pure random for each method call. Well, actually I could imagine that it is not pure random, but that they have implemented the Windows Explorer as a lot of micro fibers to be at most scalable, what I think is a great design. But in the log, it actually looks like a weird sequence of calls, since it is so highly parallelized. Windows Explorer know quite well that most of its calls (like "load icon of entry" and "load text of entry") are not really a sequence and can be called in parallel. It is really calls them in parallel.
So what happens actually is that a lot of threads in parallel try to run the algorithm printed above. And it happens that all of them see that there is no JVM and create one. Stop! Remember, only to call JNI_CreateJavaVM once? Oops...
The solution is pretty easy. Windows has a great high performance facility for preventing several threads of the same process to run through the same piece of code concurrently: Critical Section. It is like a Mutex, but it only works inside of the same process, since it actually is not a kernel object but just a shared memory region. Internally it uses an atomic machine code instruction (test and set), so it is really, really fast, compared to a kernel object like a Mutex or a Semaphore. Ok, just four code lines (InitializeCriticalSection, EnterCriticalSection, LeaveCriticalSection, DestroyCriticalSection; BTW, DllMain is a great place for one-time DLL initialization and de-initialization!) will solve that. Wow, yes, it works now. Really cool. But what is that? Another creash?! But why?
Again I searched the logs and I found that it happens in AttachCurrentThread. Strange. That code works pretty well a lot of times, but it crashes down sometimes. And that sometimes is always the same: After the DLL got unloaded from memory and reloaded again by the same process. But why? In fact it seems as if there is a bug (or another undocumented restriction) in jvm.dll: If the calling (!) DLL is unloaded and reloaded, then AttachCurrentThread will crash. Actually I do not understand it, because if you unload and reload jvm.dll itself, it works pretty well. Strange, very strange. I spent another four hours but did not find the cause. I could imagine a reason: Maybe the jvm.dll writes a pointer back to "itself" into the heap, and when my DLL gets loaded, it is found in another code location, so the pointer now is pointing to "Nirvana". But it is just an idea, I have no proof for that, since I do not have yet debugged the jvm.dll source code (what I maybe should do next).
So for now, I just added a workaround: I always return S_FALSE from DllCanUnloadNow to prevent my DLL from getting unloaded. While that is not very smart (since it prevents the Explorer's memory manager from working nicely) it in fact works around the symptom: The DLL never gets unloaded, so it never will run into the problem. And it works, as it seems.
I will keep you updated about that. I reported the problem to Sun. Maybe they tell me, where "my" fault is. :-)