Skip to main content

Decora and Multithreading

4 replies [Last post]
liquid
Offline
Joined: 2005-06-16
Points: 0

Hi everyone (especially Chris, since i believe you might be one of the few that can answer the question)

After building the Decora SSE backend, i'm not so sure i can see a definitive increase in performance over the regular Java based renderer (a testament to the power of the vm :). However, i see that the backends (java & simd) only use one of my cpu cores.

I had a thought, since 1) the base jsl shader for an effect is the same for cpu based and gpu based renderers, and 2) i assume the style of coding in jsl shaders allows for parallel execution (otherwise i believe it would quite hard to generate pixel/fragment shaders), would it then be hard to add multithreading to those renderers so they use all the available cores.

The first idea that came to my mind was just to divide source content and give them to 'X instances' of the renderer code (inside the renderer singleton).

Is this a reasonable way ? Could i do it myself for instance ?

I haven't seen much activity in the svns for scenario and decora recently so i guess you must all be working on something else for the javafx sdk release - or on a private svn :( - and maybe you have other tricks up your sleeve for the software decora performance we don't know about (mind to share if it's the case) ?

BTW, the d3d backend perf rocks (and the zoomy demo is quite trippy, it's like getting back to the 70s)

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
Chris Campbell

On May 26, 2008, at 12:02 PM, scenario@javadesktop.org wrote:
> Hi everyone (especially Chris, since i believe you might be one of
> the few that can answer the question)
>
> After building the Decora SSE backend, i'm not so sure i can see a
> definitive increase in performance over the regular Java based
> renderer (a testament to the power of the vm :). However, i see that
> the backends (java & simd) only use one of my cpu cores.
>

The native backend is indeed quite a bit faster than the pure Java
backend (around 50% faster in some cases, and possibly more in others
as well). You can use the run_demo.sh script in the Decora-Demo
project with the -perf option to measure the speedups on your machine.

> I had a thought, since 1) the base jsl shader for an effect is the
> same for cpu based and gpu based renderers, and 2) i assume the
> style of coding in jsl shaders allows for parallel execution
> (otherwise i believe it would quite hard to generate pixel/fragment
> shaders), would it then be hard to add multithreading to those
> renderers so they use all the available cores.
>
> The first idea that came to my mind was just to divide source
> content and give them to 'X instances' of the renderer code (inside
> the renderer singleton).
>
> Is this a reasonable way ? Could i do it myself for instance ?
>

I think it would be possible to divvy up the work for each individual
filter() operation and send the work to a worker thread pool. For
example, if there are four available cores, you could treat the
destination image as four horizontal slices and then set up the source
regions to target the appropriate slice in the destination. I suspect
that there will be much higher overhead in dividing up the work this
way and handing it off to different threads, so you would only want to
use this approach if the performance benefit from multithreaded
processing outweighs the cost of setting everything up, i.e., only if
the source/dest images are sufficiently large.

We haven't done much investigation in this area, and aren't planning
to do so in the foreseeable future, so if this is a topic of interest
for you, please feel free to experiment and report your findings to
the list.

Thanks,
Chris

> I haven't seen much activity in the svns for scenario and decora
> recently so i guess you must all be working on something else for
> the javafx sdk release - or on a private svn :( - and maybe you have
> other tricks up your sleeve for the software decora performance we
> don't know about (mind to share if it's the case) ?
>
> BTW, the d3d backend perf rocks (and the zoomy demo is quite trippy,
> it's like getting back to the 70s)
> [Message sent by forum member 'liquid' (liquid)]
>
> http://forums.java.net/jive/thread.jspa?messageID=276623
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@scenegraph.dev.java.net
> For additional commands, e-mail: dev-help@scenegraph.dev.java.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@scenegraph.dev.java.net
For additional commands, e-mail: dev-help@scenegraph.dev.java.net

liquid
Offline
Joined: 2005-06-16
Points: 0

A little update after a couple hours of hacking...

1) What i saw by eyeballing got actually weird with -perf. The SSE backend is actually perforwing _worse_ here than the java one, so something must be wrong on my computer (Athlon64, etc.). I especially tested shadowtest, since in your commit logs the exec time is something like 50ms IIRC. On my machine, it's more like 273ms (and 223ms for java)

By switching the SSE2 arch flag (which the cpu should support), the perf is even worse... Something's up.

I also tried weird combinations of o2 ox SSE SSE2, or nothing at all. And the winning combo by a ridiculous margin is o2/SSE for 272ms...

I also noticed the perf is better after launching the same -perf demo multiple times, maybe windows XP does some weird things (IIRC it somehow 'optimizes' executables it launches so maybe it's that).

Or maybe it's actually because of Visual C++ Express 2008, someone might have made it intentionnally optimize real bad :)

2) I manually multithreaded the software peer of an effect in 10 minutes or so, so it's really hacked up with an axe. The results are ... how do i say this ... promising. I just modified the brightpass filter, which was the simplest jsl i could find (so the software peer would not be that complicated).

I basically cut the loop in two horizontal slices as we said earlier, and send those to two futuretasks (i'm testing on a dual core). I -perfed with lots of programs running (so the results might vary), a modified version of BloomTest, in order to only have the brightpass effect there, on 2 images, the regular 512*256, and one 1920*1200.

I also tried multiplying the -perf duration by 10, and took the scores after minimizing and restoring the window a lot (the scores are somehow different, so i took the lowest of those) in an attempt to warm up the soup even more...

The really informal results:
Image size, 1) Single threaded, 2) Multithreaded

512*256 1) [12.80ms - 12.85ms] - 2) [6.67ms - 7.45ms]
1920*1200 1) 205ms - 2) 127.7ms

If i can do that in a couple of hours without understanding much of the architecture or the shader, i'm sure it would be a piece of cake for someone who does.

I really like decora i must say, and it tickles my mind, i have a couple ideas that might be worth exploring: deployment, sharing, visualizing, i'll try those another day. For those interested, they're all in my twitter stream (lqd)

Chris Campbell

On May 28, 2008, at 9:19 AM, scenario@javadesktop.org wrote:
> A little update after a couple hours of hacking...
>
> 1) What i saw by eyeballing got actually weird with -perf. The SSE
> backend is actually perforwing _worse_ here than the java one, so
> something must be wrong on my computer (Athlon64, etc.). I
> especially tested shadowtest, since in your commit logs the exec
> time is something like 50ms IIRC. On my machine, it's more like
> 273ms (and 223ms for java)
>

Which model of Athlon64 are you using, just for reference?

To be honest, we haven't (yet) done much performance analysis of the
Decora-SSE backend on different CPU architectures. The few of us
working on Decora have similar machines, for example I'm using a MBP
with 2.4 GHz Core 2 Duo, so we haven't ventured outside that circle yet.

> By switching the SSE2 arch flag (which the cpu should support), the
> perf is even worse... Something's up.
>
> I also tried weird combinations of o2 ox SSE SSE2, or nothing at
> all. And the winning combo by a ridiculous margin is o2/SSE for
> 272ms...
>
> I also noticed the perf is better after launching the same -perf
> demo multiple times, maybe windows XP does some weird things (IIRC
> it somehow 'optimizes' executables it launches so maybe it's that).
>
> Or maybe it's actually because of Visual C++ Express 2008, someone
> might have made it intentionnally optimize real bad :)
>

We've only done performance testing on the Decora-SSE backend using
Visual C++ 2003. From past experience, we know there can be a lot of
variation in performance caused by using different compilers, so it's
quite possible that things are different with VC 2008.

> 2) I manually multithreaded the software peer of an effect in 10
> minutes or so, so it's really hacked up with an axe. The results
> are ... how do i say this ... promising. I just modified the
> brightpass filter, which was the simplest jsl i could find (so the
> software peer would not be that complicated).
>
> I basically cut the loop in two horizontal slices as we said
> earlier, and send those to two futuretasks (i'm testing on a dual
> core). I -perfed with lots of programs running (so the results might
> vary), a modified version of BloomTest, in order to only have the
> brightpass effect there, on 2 images, the regular 512*256, and one
> 1920*1200.
>
> I also tried multiplying the -perf duration by 10, and took the
> scores after minimizing and restoring the window a lot (the scores
> are somehow different, so i took the lowest of those) in an attempt
> to warm up the soup even more...
>
> The really informal results:
> Image size, 1) Single threaded, 2) Multithreaded
>
> 512*256 1) [12.80ms - 12.85ms] - 2) [6.67ms - 7.45ms]
> 1920*1200 1) 205ms - 2) 127.7ms
>
> If i can do that in a couple of hours without understanding much of
> the architecture or the shader, i'm sure it would be a piece of cake
> for someone who does.
>

This is all very cool! The results do look promising. I would
encourage you to keep plugging away at it and see if you can
generalize your code so that it works for all effect implementations,
scales dynamically to use N threads, and has minimal overhead
(compared to the current single-threaded case). If you can make
progress in that direction, we would be happy to incorporate your work
into Decora.

It's too bad that we can't easily apply this technique to the Decora-
SSE backend. The main limiting factor is the use of Get/
ReleasePrimitiveArrayCritical() calls at the native level. I suppose
it's possible to spawn native pthreads to handle the inner loop inside
those critical sections, but it sounds dangerous and you certainly
don't have the convenience and flexibility of using Java threading and
concurrency APIs at that level.

(This is all the more reason to focus your efforts on the Java backend
for now, since our goal is to get some more help from the HotSpot
gurus on improving performance of the Decora-Java backend. After all,
the HotSpot code generator, especially C2, is in a good position to be
doing this SSE optimization for us, instead of us diving down into
native and trying to do it ourselves. More on that later.)

> I really like decora i must say, and it tickles my mind, i have a
> couple ideas that might be worth exploring: deployment, sharing,
> visualizing, i'll try those another day. For those interested,
> they're all in my twitter stream (lqd)

Sounds great... If you have any interesting conclusions, feel free to
share with this list, for the benefit of those not following you on
twitter :)

Thanks,
Chris

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@scenegraph.dev.java.net
For additional commands, e-mail: dev-help@scenegraph.dev.java.net

liquid
Offline
Joined: 2005-06-16
Points: 0

[ Which model of Athlon64 are you using, just for reference? ]

Athlon64 X2 3800+ running on windows xp 32b ...

[ I would encourage you to keep plugging away at it and see if you can generalize your code so that it works for all effect implementations, scales dynamically to use N threads, and has minimal overhead (compared to the current single-threaded case) ]

Yeah that was my goal :) i first tried to check the 'theory' on a simple single effect before modifying the software peers' template. Interestingly enough, i first tried to do pretty things by having a delegate function, but the perf was best when i just did it like a pig and copy pasted the whole shader body (hum) ...

I also need to test on many effect instances at once, and see what would be the threshold where you would just do the filter in the current thread instead of splitting it up.

I'll keep focusing on the java backend some more and will then set up the source/binaries for anyone who whants to help, test ...