# Trig performance

I have a bunch of math-intensive code that runs slowly when the inner loops contain calls to Math.sin & Math.cos (much more slowly than without the trig calls and much more slowly than corresponding C++ code). Is there any way to speed up these calls significantly? I'm running Java version 1.5.0 (build 1.5.0_04-b05) on Windows XP. Thanks, - Conal

Whats the current status about this matter ?.

Are there any plans to fix it ? or is java doomed to have truly piss poor trig performance forever ?.

If sun wants to exclude java as a *competitive* 3D gaming platform by blatant design decisions, its really silly.

Since this thread was last active, Java has been open sourced. So you can now attempt a fix yourself. It wouldn't even be hard --- define a new FastMath class which is the same as the existing Math class except that HotSpot is modified so that the generated intrinsic code for the trig methods omits the argument reduction step.

It might be a good idea to locate the new class in a package like javax.math rather than java.lang. That way it would be less likely to be used inappropriately by newbies.

What I keep trying to say (and the signal apparently keeps getting lost) is that these "worthless" hardware routines have been in use for upwards of 10 years for display transformations of 2d and 3d data in production software in use all over the world. Clearly they have worth, and clearly they have a lot more worth than simply returning 1 or 0.

Ironically, while I will not argue that the java math definitions are far more accurate and controlled, they are also "pricing themselves" out of the market right now. Routines that take 3-5 times as long to render an image to screen are not going to get used on a system that is (according to the users) "never fast enough". The overheads of JNI calls for fine grained routines like this are prohibitive, and the only option left is to use any one of the other platforms that do allow fast and light access to the hardware routines.

So which has more worth? Routines that have proved they are good enough, and fast enough, for upwards of 10 years in a real world application, or routines that for this application are slow enough to force a developer to avoid them?

This would not be an issue if there was any way for a developer to turn in Java to harness these routines - we don't look forward to arguing about it - we just want to get in and do it, but JNI has too many overheads, licensing prevents alteration of the JVM itself for redistribution, and lookup tables have been tried but in order to improve on the observed accuracy of the hardware trig routines have proved to be no faster and waste a lot of memory, not to mention the setup time.

As a developer, and representing other developers that know far more about it than I do, what is the course of action to get trig as fast as the hardware routines, and as accurate as the hardware routines in java without lots of memory usage and setup time? So far, the only answer that fits is use another platform than Java.

What do you have to do to get faster math?

Get an appropriate JSR through the Java community process. There seem to be two main options for the change:

a) relax the specification on the existing Math methods,

b) Propose a new class, say PlatformMath, that has relaxed specifications.

I doubt that you could get approval for (a). Which leaves the option of creating a third set of math methods, which may not go down well either. Nor is it likely to be approved without some sort of specification, but that could be created during the process. It is worth noting that a number of other JSRs aimed at improving math performance have failed; in at least one case allegedly due to lack of interest.

> Honestly, from my point of view, I'd like to be able

> to say:

>

> Math instance = Math.getInstance(precision)

>

> where precision would specify how many digits of

> precision I am interested in. That would solve

> everyone's requirements ;) Would this be hard to

> implement?

I don't think such a method would be helpful in this case.

First, very few people are able to do the sort of numerical error analysis needed to use such a method responsibly. The error in a computation is often not a simple function of the errors in its component computations and the analysis can be extremely tedious.

Second, the cost of sin/cos for large values is dominated by the cost of argument reduction, not evaluating the sin/cos function per se. (In argument reduction, input values over a wide range are mapped to an equivalent value in a chosen range. Roughly this means mapping an input to a single period of sin/cos, but symmetries are used to make the actual target range smaller.) Therefore, the systematic error in fsin/fcos is not well-captured by the number of bits (or digits) of precision at a particular point. To sanction fsin/fcos the precision may need to be set to zero, which would license many undesirable alternatives.

Third, accuracy and performance are not the only figures of merit for a floating-point approximation; other consistency properties such as monotonicity can be paramount but difficult to find or specify. The benefit of using correctly rounded or nearly correctly rounded math functions is that whatever known (and unknown!) consistency properties of the underlying mathematical function a code might be relying on ( |sin(x)| <= 1, etc.) are most likely to be satisfied the more accurate an approximation is.

I just think its funny that we developers just want a really really fast sin/cos (probably for some simple but fast thing we need to do) and we all get stuck in a complicated discussion about this and that.

Can't someone just say Yes or No and Fix it or Not fix it?

It's always this thing where the current responsible developers have a lot of pride and starts of on the defensive side explaing why this and that decision was correct. And it almost always was, but the world doesn't stand still and old decisions gets to be just that, old.

Less talk, more Do It; that's my motto at least. Now, less talk, more coffee (another motto).

Cheers,

Mikael Grev

Developers need first to understand what it is reasonable to request. Given their specification, the existing StrictMath and Math methods are now (Mustang) about as fast as they going to get. If you want faster, you need to request new methods (either in an existing class or a new 'PlatformMath' class which have a different specification that permits a direct implementation by instructions like the x86 fsin instruction.

To date there have been bug reports requesting that the Math.sin (etc) methods conform to the specification --- resolution of this produced the much slower methods in 1.4. Then there have been bug reports requesting that the trig functions be faster, and they have done that too.

Developers now need to contribute to deciding what would be an acceptable specification. Would it be ok it implement sin with a domain of only +- pi/4 if that was what the platform provided? If not what should be the minimum guaranteed domain? What minimum precision to you require (and how will you specify it)?

I think I will butt out of these discussions because I am no longer a stakeholder ;) A while back one of my projects depended heavily on sin/cos performance, but I've since moved to a new project. I can't discuss what level of precision or portability was important for that project because I have no one to ask (I no longer work for this company).

It would be good to hear from real stakeholders with practical requirements, not idiological ones like I had ;)

Gili

> But if there is no longer any difference between Math

> and StrictMath's performance, what is the point of

> keeping both around? Shouldn't we deprecate one?

There is a difference between the specification of Math and StrictMath. The interesting StrictMath methods are required to use a particular algorithm, fdlibm, to ensure cross-platform reproducibility. Previously this was the definition for the Math class, but since developers had need for higher-performance mathematical methods when full reproducibility wasn't needed, the Math/StrictMath dichotomy was introduced in JDK 1.3. The fdlibm algorithms are pretty good in terms of speed and accuracy, but they are not new and algorithms that are better in various ways have been developed in there intervening 10+ years since fdlibm was written.

The fdlibm StrictMath routines also meet all of the general quality of implementation criteria required by the corresponding Math method. In other words, it is always legal to implement Math.foo by calling StrictMath.foo and the source code for the Math class reflects this. However, a JDK implementation is free to use a different algorithm for Math methods, as long as the quality if implementation criteria are met. (Sun's virtual machines have long optimized various Math methods as intrinsics so the source in java.lang.Math may not fully represent how that code is executed at runtime.) Whether or not faster (or more accurate) Math algorithms are used is a quality of implementation issue for that JDK. In Tiger and Mustang, we at Sun have optimized more of the Math methods than in previous releases.

Since Math and StrictMath serve different purposes, neither one should be deprecated in favor of the other.

> Although Math.sin is less strict than StrictMath it

> still has a performance guarantee.

The above statement is true if "performance" is replaced by "correctness"

> Unfortunately the

> x86 fsin instruction can't meet even this relaxed

> requirement.

Correct; fsin and fcos cannot directly implement an accurate sin/cos function over the entire range of floating-point values. Intel's manuals discuss how to use fsin/fcos as a core to implement a full sin/cos method. One way is to use fsin/fcos in [-pi/4, pi/4] and do your own argument reduction outside this region; this is implicitly required by the current Math specification. Another way is to do argument reduction with the same approximation to pi as the fsin/fcos instructions use outside of the [-2^63, 2^63] region where these instructions produce useful results.

> The older implementations used fsin

> directly because the JVM implementors were not then

> aware of the defect.

Correct; several years ago we had a sin/cos implementation that used fsin/fcos is a way that violated the specification. We have since fixed that problem.

> I don't think it would now be

> appropriate to further relax the specification for

> Math.sin etc.

> Perhaps a new FastMath class could be considered, but

> what specification should be required (if any). How

> bad would FastMath.sin be allowed to be and what

> domain should it support for its argument?

Yes, if a "fast sin" method with further relaxed requirements were to be introduced, I think it should either be in a FastMath class or a separate function in Math/StrictMath. As you note, one difficulty is enumerating what implementation criteria should be used. Would "return 0" be okay for sin and "return 1" be okay for cos?

Honestly, from my point of view, I'd like to be able to say:

Math instance = Math.getInstance(precision)

where precision would specify how many digits of precision I am interested in. That would solve everyone's requirements ;) Would this be hard to implement?

Gili

> Math instance = Math.getInstance(precision)

>

> where precision would specify how many digits of

> precision I am interested in. That would solve

The trouble is that for some values the x86 achieve almost no precision at all (if measured in the way used by the Java specification). The result of

Math.sin(Math.PI)

effectively computes the difference between Math.PI (which is pi rounded to the nearest double value) and the approximation of pi used by the argument reduction mechanism. So we are subtracting two almost identical values and comparing against the result that would be obtained by using infinite precision.

If instead we were to measure the error in terms of the argument supplied, then the precision obtained by fsin looks very much better (although there is still the issue of the +- 2^64 limit on the argument value). So you need to consider not just the precision and how that might be specified but also the range of arguments accepted by each method.

Although Math.sin is less strict than StrictMath it still has a performance guarantee. Unfortunately the x86 fsin instruction can't meet even this relaxed requirement. The older implementations used fsin directly because the JVM implementors were not then aware of the defect. I don't think it would now be appropriate to further relax the specification for Math.sin etc.

Perhaps a new FastMath class could be considered, but what specification should be required (if any). How bad would FastMath.sin be allowed to be and what domain should it support for its argument?

But if there is no longer any difference between Math and StrictMath's performance, what is the point of keeping both around? Shouldn't we deprecate one?

Gili

There is still a difference between Math and StrictMath. StrictMath will always produce exactly the same result on any machine --- bit identical. The results of Math may differ from the exact result by typically +-1ulp (see the specification for the actual value).

StrictMath: bit identical (but not necessarily exact) results produced by a specified algorithm. Usually slow.

Math: Differ from exact result by no more than a specified tolerance. Still requires use of up to ~1000 bits of PI to reduce trig arguments. Intel x86 only uses 66 bits of PI.

PlatformMath: PI == 3. Very fast but may not provide the answer you expect.

jddarcy wrote:

> The underlying challenge is that the specification for the trig methods requires us to get more or less the right answer :)

I love Java's commitment to basing sin & cos on math rather than on popular hardware. Are there compelling reasons not to provide [i]additional[/i] static methods that agree with fsin/fcos in speed and value? They could be called "fsin" & "fcos" (or whatever), leaving "sin" & "cos" for functions that agree, within an ulp, with the classic mathematical functions.

I would be against adding new APIs because from a consistency point of view it is confusing. One question though: why does the non-strict Math library have to return the correct value? I believe prior to Java 1.4 it used to return the CPU result, why was this changed?

StrictMath should definately remain accurate and portable, regardless of the CPU, but Math maybe is another story.

cowwoc wrote

> I would be against adding new APIs because from a consistency point of view it is confusing.

I'm for the principle of consistency, and I don't get how it applies here. Would you please explain?

Changing the [i]meaning[/i] of a function or changing its interface, or having sin more stringently accurate than cos would be "consistency" break as I use the word. Adding a new function with a different meaning keeps consistency in the meaning I understand and value.

Your recommendation about StrictMath vs Math seems a fine alternative to me personally. I'm guessing, though, that there are stronger objections to that solution than to a backward-compatible extension as I was suggesting. I'd like to hear what those objections are, but more than that, I'd really like to get fast access to fsin & fcos.

Cheers, - Conal

As far as I know, the Math vs StrictMath solution is the ideal approach because Java 1.3's Math provided fast fcos/fsin like you requested. If anything, Java 1.4 broke backwards compatibility and we'd simply be undoing that change. I believe the intent in Java 1.4 was to fix StrictMath (which wasn't returning accurate results) but in the process they mistakenly "broke" Math. This wasn't undone because no backwards compatibility was really broken (performance was worse, but method output was even more accurate than before) and a performance regression tends to get low priority in BugParade.

Anyway, I think it's about time we fix this performance regression. Unless I've missed something?

alanstange wrote:

[i]Give the mustang builds a try. Many of the Math.* routines have been "intrinsified" for extra performance.

Please try it and post the numbers comparing the 3 systmes (1.5, C++ and 1.6)[/i]

Thanks much for this tip. I'm happy to say that the speedup for the Mustang server JVM is quite considerable. In one example with trig in the inner loop, the speed went up from 360K to 970 pixels/second. While I'm still far from the C++ speed, it's a big help. There's a bit more detail [url=http://conal.net/Pajama/timings]here[/url].

However, getting the speedup required using the [i]server[/i] JVM, and my application is defintely for running on clients in web pages. So I don't know if the server JVM improvements are of any help.

Hi again,

> However, getting the speedup required using the

> [i]server[/i] JVM, and my application is defintely

> for running on clients in web pages. So I don't know

> if the server JVM improvements are of any help.

Not really since setting the server-hotspot as default requires the user to 1.) install jdk and 2.) change some settings in the applet control panel :(

The only thing I know is that for Dolphin (java-7.0 somewhere 2007) a new compiler is planned which will merge the client and server-jvm bringing the best tigether (and making footprint larger, as always ;) ).

lg Clemens

Concerning trig, could someone give a clearer description of what Mustang has done to be faster and yet maintain backwards compatibility?

For requirements, really what's wrong with simply saying "implement a pass-thru to the hardware routines"? If these are to be exposed through a third library that is all that is required. Mark them with warnings about platform dependency, inaccuracy and as many general surgeon health warnings as you want, but get implement them with as little surrounding code as possible. The idea is to give a fast and light way to access the hardware routines, any kind of range checking, argument reduction or anything else will just slow things down.

Java is the only popular platform which does not allow direct access to these routines in some form (not including JNI which also makes the whole operation too slow). Make the default math and strict math have the checks, regulations and limits, and just implement the fastmath library as a straight through call to the hardware routines. Specs and requirements for the calls will coincide with whatever the hardware routines can do.

And I do wish that comments like Sin always returns 0, Cos always returns 1 would be left out. These are the worst kind of propaganda, I know many and have worked for several companies for whom the accuracy of the hardware routines is just fine. They are not random, they are accurate to at least 5-6 decimal places in the worst cases I have seen (usually much better), and in the cases they are used the changes are too small to be seen by the human eye (for 2d and 3d display transformations for example). It's worth remembering that any representation of a rational fraction in a finite mantissa is an approximation.

Please just make them call the hardware routines, skip any checking, and work exactly as the hardware ones do.

> For requirements, really what's wrong with simply

> saying "implement a pass-thru to the hardware

> routines"?

Because some hardware has a very limited implementation of trig functions. These instructions are only a partial implementation of a sensible trig method. The original x86 floating point processors only had a FPTAN instruction which required an argument in the range +-pi/4. The more useful FSIN, FCOS instructons were first implemented by the 80386 coprocessor. This would only be of historical interest if it wasn't for various game console processors, mobile phones, etc which may also have relatively limited hardware floating point instructions.

> And I do wish that comments like Sin always returns

> 0, Cos always returns 1 would be left out. These are

> the worst kind of propaganda, I know many and have

> worked for several companies for whom the accuracy of

> the hardware routines is just fine.

While this may be true of the most popular general purpose processors, Java also gets used on more restricted hardware. Methods without a specification are every bit as worthless as sin returning 0 for all arguments. What most people are probably arguing for is something that allows the current Intel fsin/fcos instructions used without actually mentioning Intel in the specification. Why isn't there a big call for float versions of sin/cos? Largely I suspect because that makes no difference on x86.

Why not just use fsin/fcos (or whatever fast hardware versions is available) on the platforms where it's good enough and revert to Math.XXX when there's a very bad implementation?

Because;

Then the math guys says - Hey what exactly do you mean by "good enough", followed by a sentence that shows off their math skills in good geek manner - and the original poster gets a face that looks like "Oh, my God, can't he figure that out as he goes along or do he need all the specs up front or what?".

Seriously, it's easily done, heck a new baked CS guy can do it in short time in a whay that would satisfy 99% of the developers wanting this and the other 1% can happily use Math.XXX.

Cheers.

Conal:

Sorry for the delay in responding...

I've asked one of our floating point guru's for guidance

on the trig questions... I know that many trig

optimizations went into Tiger.

As far as the other thread is concerned the

bounds check elimination should be happenning...

I'm expecting one of the other JVM engineers to

clarifiy if hoisting out the bounds checking is

(or will be) in a 5.0 update release or only

in Mustang.

In any case you are definately getting bounds

check elimination in your Mustang test above.

I understand that requiring the server compiler

is more heavyweight for desktop applications.

However yours seems to be an intensive app that

really wants the server compiler (advanced

optimizations). We have found, for example,

that after the JVM is warmed up that big desktop

apps like Looking Glass benefit from the server

compiler.

Perhaps downloading the JDK would be easier

if you launched the app with Java Web Start?

Alternatively you may need to have documentation

(a wizard?) to help users set command line

arguments for Java Plug In.

HTH,

--Tom

> I've asked one of our floating point guru's for

> guidance

> on the trig questions... I know that many trig

> optimizations went into Tiger.

The underlying challenge is that the specification for the trig methods requires us to get more or less the right answer :-) As a consequence, especially for large arguments, we have to do more work than some other environments choose to do. Gosling blogged about this subject earlier this year:

http://blogs.sun.com/roller/page/jag?entry=transcendental_meditation

http://blogs.sun.com/roller/page/jag?entry=transcendental_two

More background can be found in the evaluation of bug 5005861 "native libraries for java.lang.Math should be non-strict."

For many releases we have used the x87 hardware fsin/fcos instructions where appropriate; over time we have been reducing the overhead to access this functionality via the math library.

Why can't we use the CPU result for those CPUs which return a correct value and the current implementations safe result for those CPUs known to return bad values? Right now it seems that we're applying the slower algorithm to them all.

> Why can't we use the CPU result for those CPUs which

> return a correct value and the current

> implementations safe result for those CPUs known to

> return bad values? Right now it seems that we're

> applying the slower algorithm to them all.

On what CPUs do you think the slow algorithm is being used unnecessarily?

According to http://blogs.sun.com/roller/page/jag?entry=transcendental_meditation the K5 and Itanium processors return correct values (and if I understand correctly, all newer Pentium CPUs as well).

As I understand it the newer pentium CPU have not been 'fixed'. So in the x86 arena only the K5 returns the value required by Java.

> As I understand it the newer pentium CPU have not

> been 'fixed'. So in the 86 arena only the 5 returns

> the value required by Java.

Correct.

Even on the K5 a "naked" fin/cos is not sufficient for a compliant Java sin/cos implementation because that instruction only returns a usable result in the range [-2^63, 2^63] and the full range of double is [-2^1023, 2^1023]. The sin/cos function if perfectly well-defined mathematically on arbitrarily large finite values.

For the x87 fin/cos instructions generally, they can be used directly for accurate sin/cos in the range [-pi/4, pi/4] because in that region argument reduction is not needed. The unacceptable error in these instructions is introduced by suboptimal argument reduction. The [-pi/4, pi/4] range includes about half of all floating-point values so in some sense it is a large region. In this range, Java should perform very nearly the same as other environments. Some sin/cos microbenchmarks sample much more frequently from values larger in magnitude where a more expensive argument reduction process is needed to get an accurate answer. If that sampling doesn't match your application, the reported performance of Java sin/cos will be skewed.

Give the mustang builds a try. Many of the Math.* routines have been "intrinsified" for extra performance.

Please try it and post the numbers comparing the 3 systmes (1.5, C++ and 1.6)

There's not a lot you can do in the 1.5 world. The issue is that the Java FP operations are defined to behave in a certain way. C/C++ programs can violate IEEE754 FP semantics in small ways which Java can't.

I ran similar trig tests myself. Aside from the obvious performance difference of trig functions under C++ and Java I noticed another really silly problem with Java: HotSpot does not extract constants from inner loops.

For example, if I had a loop:

[code]

int myArray[][]; /* pre-initialized */

int sum = 0;

for (int i=0; i<100000; ++i)

{

sum += myArray[0][ i ];

}

[/code]

The above would always execute slower than:

[code]

int myArray[][]; /* pre-initialized */

int sum = 0;

int const1 = myArray[0];

for (int i=0; i<100000; ++i)

{

sum += const1[ i ];

}

[/code]

Similarly, if you replace the array with a simple constant, declaration, the code would always execute faster if the constant was declared outside the loop then inside which seems silly to me. I'd expect HotSpot to be smart enough to identify constants inside loop and isolate them so they are only evaluated once. This caused major performance differences last time I checked.

Gili

Message was edited by: cowwoc (Array index i was misinterpreted as italics)

Some hardware routines may have very poor implementations of trig functions, but if you provide functions that have a contract and improve on the hardware performance in some way, such as mapping the argument into the range from -pi/4 to +pi/4, then there is no way for a user to write a wrapper around that that gets them the true platform number.

On the other hand, if you provide the raw numbers and they don't like it for whatever reason, they can easily write a wrapper that fixes the restricted domain of the platform functions, or fixes anything else about it.

On the gripping hand, if anyone doesn't care for these raw, native functions, you can always just point them to Math and StrictMath. It isn't as if anyone is proposing doing away with those.