Posted by opinali
on October 29, 2009 at 8:36 AM PDT
In my last attempt to stress the JavaFX platform, I ported the Strange Attractor demo/benchmark. Different from JavaFX Balls , this is not scenegraph-driven animation, but old-school "pixel by pixel" drawing… still, makes for another batch of interesting findings, including a few issues in the JavaFX Script language and its compiler, and other topics like fractal maths, BigDecimal, and JDK 7's stack allocation.
UPDATE: All webstart apps here are now updated for JavaFX 1.3, so their performance may be different from what is described by the article.
The porting was easy, as soon as I found my way around JavaFX's limitations. It happens that JavaFX doesn't let you "paint" a component, like AWT/Swing and most other 2D toolkits. You can only "draw" things by composing scenegraph nodes. But this wouldn't work here: Strange Attractor is a particle animation demo, and it uses 300K particles to render a 3D fractal. I could use a tiny, pixel-sized rectangle for each particle, but this very likely would come dead last in the performance race. Even if JavaFX's scenegraph scales very well, the memory weight of all those nodes and the rendering overhead would certainly kill it.
But the solution is simple. First, I create a ImageView node for the animation. This contains a Image object that is initialized from a blank image. So far, standard stuff. Now, in order to "paint" the fractal in this Image, I do this:
function move (deltaX:Float, deltaY:Float)
var pixels = ((img.platformImage as BufferedImage).getRaster()
.getDataBuffer() as DataBufferInt).getData();
pixels[index] = min(pixels[index] + 0x202020, 0xFFFFFF);
I have to resort to two "internal" tricks. First, I access the Image's platformImage property; its declared type is opaque (Object), as the actual type is platform-dependent. For the desktop profile, implemented on top of Java SE APIs like Java2D, that type is BufferedImage. So I just need to cast, then use standard Java SE APIs to put my dirty hands on the int array that contains the pixels. I can fill this array with black with Arrays.fill(), read and write individual pixels by just indexing its positions, etc.
In the second trick, as soon as the frame is complete, I call ImageView.impl_transformsChanged(). This is another internal method; it is invoked automatically by the runtime when the node's transforms are changed. Normal apps never need to call it, so it's not a official API. But it has the side effect of forcing the ImageView node to refresh itself from the backing pixels. Notice that my ImageView has no transforms at all, so this should not perform any other redundant work.
In an ideal world, we'd have a official ImageView.invalidate() method. There are some issues with my hacks, so I filed the bug RT-5548 : Provide (official) support for bitmapped rendering. I explore the issue in more depth in this bug; so just read, comment, or vote there if you are interested. This is all we can do to influence/lobby a project that's not open source. I will just paste my final comment here: "Right now JavaFX is pretty hard for third-party extension developers. Suppose I want to create a new Control that really demands custom (non scenegraph-based) rendering, what should I do? Full source code is not available so I can't consult it; in-depth technical documentation does not exist at all; the platform still misses important functionality for some people. If the team at least provides some guideline about JavaFX-to-native-2D integration, at least we can work around these limitations while they exist."
So, just how fast can JavaFX move these particles? I've found that this depends on several factors, so I actually created four variants of the program, identified below by their class names (I've bundled each in a single .fx file, mostly to make the variants easier to manage). In these names, Float = float precision; Double = double precision; List = particles are stored in an ad-hoc single linked list with a next pointer in each particle; Seq = particles don't have that pointer, but they are stored in a JavaFX Script sequence. The version that matches the other ports of Strange Attractor is MainListDouble.
I tested with the early access of JDK 6u18 , yet another important update for client-side Java in general and JavaFX in particular; for this specific benchmark, 6u18 brings a update version of HotSpot so CPU-bound code should benefit. (Click each program's name to launch it; source code here .)
Some pretty interesting results here. First, the score seems to be very influenced by memory access. The major difference between the four variants is the size and layout of particle data. Each Particle is a simple object with x, y, z fields; and also the Java object header, and an extra int VFLGS$0 field that's used by JavaFX Script's properties (each 32 properties share one such field which is a bitmap; classes with more than 32 properties need additional bitmap fields). We have 300K particles, so a Float particle is 24 bytes = 7,2Mb for the entire fractal; and a Double particle is 36 bytes, likely 40 due to alignment = 12Mb total (estimations for a 32-bit JVM.) Even the lower value doesn't fit entirely in my Q6600 CPU's 4Mb L2 cache, and there other memory pages involved in rendering (the pixel array that's 880Kb; code from the app, JVM, OS…), so the rendering will hit the FSB hard.
The List variants are also faster, why? The datasets are the same size – there is an additional reference field in each Particle, but then I don't need a sequence object with one reference per particle. JavaFX's sequences are well optimized, they are backed by native arrays just like Java SE's ArrayList. (The real story is more complex – there are many concrete sequence impls and the compiler picks and changes the most adequate as needed; sequences of value types can map to optimized sequences without boxing overhead so it's even better than ArrayList and closer to a growable version of primitive arrays.) But there is some small overhead to iterate the sequence, and once again there's worse memory locality. In the MainSeq* programs, the heap will contain one huge array with at least 300K references = 1,2Mb; plus 300K Particle objects somewhere else. The sequence's backing array is treated by the JVM as a "large object", which by itself may have some performance consequences. But the major problem is that a sequential iteration through all particles will demand a non-sequential memory access pattern, alternating between the sequence's backing array pages, and other pages containing the particles.
The JIT compiler also appears as a very important performance factor. HotSpot Server shows a whopping 42% better frame rate in the easiest test MainListFloat; in the hardest, MainSeqDouble, it still produces a very large advantage of 25%. But this result is very interesting because the animation's inner loop is relatively simple, it just performs a few multiplications for coordinate transformation and plots a pixel in the resulting position. The particles are all constant data, the transform matrix is calculated only once per frame, and the inner loop contains no expensive operations like allocation. (There is one call to a tiny method that's trivial to inline even for HotSpot Client; performance didn't change after I refactored some code to introduce this method.) I guess HotSpot Server is just smarter in memory access, e.g. with prefetching instructions.
Then I said to myself: What a wonderful… no; I said: how could I make this code even faster? One obvious target is avoiding the cost of JavaFX Script objects, which are a bit larger than similar Java objects due to those property bitmap fields. The new variant MainListFloatJava implements the Particle class as a Java object. And while I'm at it, why not eliminating all object model overhead completely and just store all particle data in a raw float with 3 consecutive positions for each particle's (x, y, z) data? The variant MainListFloatRaw does this.
Once again, very interesting speedups. The Java variant is almost 4% better for both HotSpot Client and Server, a nice although not vast advantage; but the Raw variant is an incredible 82% better for HotSpot Client, and 53% better for HotSpot Server. (The fact that Server gets a smaller speedup reinforces the thesis that its advantage in the previous tests were mostly related to more efficient memory access – Server caught most low-hanging fruit in the previous test.)
The fact that I could make this program a full 4% faster by just rewriting a trivial class in Java means that the overhead of the binding bitmaps, inserted by javafxc, is pretty annoying. I did my best to help the compiler: my Particle class is script-private, its properties have no triggers and are never involved in binding expressions, which allows javafxc do optimize out some of the generated code. But that was not enough. Checking the bytecode, I identified several optimization opportunities so I filed bug JFXC-3456 : Optimize handling of VFLGS$N bitmaps and other property-related code. (This blog was a long time in the making, so this bug is already fixed for SoMa and marked as a dupe; it seems the Compiled bind rework of the next release, mentioned in my investigation of binding performance , is making fast progress.)
And I was also horrified with the way javafxc (alas, the JavaFX Script language) handles null values: all nulls are masked out, so we never get a NullPointerException. I failed to perceive this in previous explorations of JavaFX (it's not documented in the Language Reference). I reported this as a bug in JFXC-3447 : Support NullPointerException. Yeah I'm asking to have my NPEs (and also a few other important exceptions) back – at least, to have some control over this critical behavior; please check the bug report before you think I have some fetish for stack traces.
And yes, these results are yet another evidence that the Java platform would benefit from value types, so I could have a headerless Particle class (ok, struct) and put it in a by-value array (objects stored directly in the array without references). This would produce exactly the same memory layout of the MainListFloatRaw, except that my code wouldn't need several changes (for worse – low-level array manipulation like particles[i + 1] instead of particles[i].y, etc.) This memory layout requires 300K*4*3 = 3,6Mb, just half the footprint of MainListFloat, and it's even better as all particles are laid out in perfect sequential disposition. We're still overflowing the L2 cache, but much less than before so the performance gain is huge.
The Java community claims for some kind of value type support since ever ; the last attempt came from John Rose's Tuples proposal – a relatively modest and easy change, but still, out of JDK 7 . The Java language is basically frozen when it comes to fundamental capabilities… but not the Java platform. See the great JSR-292 that basically "fixes" the JVM for all dynamic-typed languages. This is a good precedent because this huge platform enhancement is basically useless for the Java language. The DaVinci project is also working hard on all sorts of cool features to support immediate (headerless) objects, including fixnums, tuples, structs, inline arrays in tail position, etc.; and also tail calls, continuations and other fundamental techniques that are worth gold for many languages; see this presentation . These enhancements from the DaVinci Machine will most probably not come to a future version of the Java language, but they will eventually be adopted by other JVM languages like Scala, Clojure, JRuby etc.; and obviously JavaFX Script, if Sun gets its act straight.
Meanwhile, JDK 7 is also making great progress in the Escape Analysis-based stack allocation optimization, that was recently turned on by default. Some days ago, Slava Pestov was happily tweeting how Factor kills HotSpot Server in a Mandelbrot program (we all love fractal calculation microbenchmarks…). I found not only that Java was faster (from 160ms to 46ms ) after eliminating a Complex class that caused 300Mb of allocation per run of the benchmark, but even maintaining this class, JDK 7b72 could run 2,15X faster (74ms ) thanks to reducing the churn to 110Mb per run. Keep in mind however, that Escape Analysis is by definition only good for temporary objects that don't "escape" a single method (or basic block, trace, or whatever the optimization unit); this optimization won't be any help for long-lived data like Strange Attractor's particles.
JavaFX vs. Other RIAs
Also, you cannot compare CPU usage, because my JavaFX version is smart enough to only render new frames when the 3D image's position changes; all other versions don't do that, they will peg one CPU core at 100% all the time even if you keep your mouse parked so all frames are identical.
The .NET platforms does support value types, so it could potentially be optimized to use this feature, and (unless the .NET JIT compiler is really poor) Java's only hope to match it would be resorting to a low-level implementation like MainListFloatRaw (or waiting for the fruits from the DaVinci Machine project).
Now, the most interesting comparison is not the performance, but the actual image produced by each program. The three original versions produce basically the same image, modulo details like color and some FPS display. But my JavaFX version is distinctively different – check this:
I captured the images in similar positions (mouse parked at the lower-right corner); the difference is very noticeable. My program produces a bigger image, remarkably in the outer "corona" of the fractal – the deviation seems to grow as function of the distance from center. Now, I'm just a rookie in the maths involved in these graphics: since the old times of FractInt for DOS (most awesomest fractal platform evar!), I'm content to code formulas that I find somewhere else, and amuse myself with the result without really understanding it in depth. In this case I didn't even code anything, I just ported C# code to JavaFX Script. The JavaFX image looks better and more complete, but this may be just my bias. Can anybody explain this difference?
While investigating this, I changed the code that calculates the color for each rendered pixel, so particles closest to the observer are brighter, the figure looks solid, and the object is easier to inspect. Performance goes down ~4% in HotSpot Server, ~10% in Client; but the 3D effect is pretty nice (especially when animated).
JavaFX (3D enhanced, Float):
Now, the most interesting discovery, facilitated by this enhancement, is that the fractal changes significantly with numeric precision. Compare the previous image with the following:
JavaFX (3D enhanced, Double) (click image to run):
The last image, created with Double precision, is noticeably different in the outer corona where you can see series of bands, like in a snail' shell. Most of these bands are inexistent, or very hard to see, in Float precision, because the particles that should form the frontiers are in slightly wrong positions. If you're a veteran fractal lover, this is not news – fractals are remarkably dependent on numeric precision, in fact it's one of the very few CG technique where floats are not good enough to avoid severe artifacts. Most good fractal programs offer better-than-Double precision, necessary to render in deep zoom or high iteration levels. Strange Attractor performs 300K iterations over a single (x, y, z) position; this is a enormous number of iterations, so any imprecision will quickly escalate into noticeable artifacts.
If Double is better than Float, wouldn't big decimals be even better? I recoded the calculation method with Java's BigDecimal. The resulting code is horrendous (as usual), which is bad enough in Java but definitely doesn't "fit" in JavaFX Script… you'd expect a language like that to offer a seamless arbitrary-precision decimal type. We could just have some syntax sugar over BigDecimal, to be able to use * instead of multiply(), etc. But the performance would still suck (as usual) because BigDecimal is immutable and the churn of object allocation and GC burns more CPU than the actual calculations. The Java platform desperately needs mutable counterparts of BigDecimal and BigInteger (some implementations already have these for the internal implementation of some operations, but the mutable classes are not public). Then, many high-level languages like JavaFX Script, Groovy, Scala etc., could offer a decimal type complete with operators and other special syntax and semantics, but reusing java.math's implementation and interoperable representation. The mutable objects wouldn't eliminate the advantages of immutability if the programmers don't use them explicitly – the source compiler could do that automatically to compile expressions requiring temporary values (much like javac does for string concatenations since JDK 1.0); still, public mutable APIs would allow much further manual optimization (and value-type BigDecimal, even better…). Anyway, after waiting a few seconds for this calculation (limiting precision to IEEE128 ~= 2X better than Double), the resulting image is not any different to the naked eye, so Double was already good enough.
But after this digression, the conclusion of this experiment with numeric precision is that… I still don't know why the other languages produce different output even at same precision. The Java platform is well-known to have a very strict math spec, but the fractal calculation uses extremely basic arithmetic (only multiplications and sums) so this should not be a factor.