Posted by

mkarg on January 3, 2010 at 6:29 AM PST

Several APIs demand that the user is implementing the .hashCode() method. There are lots of standard implementations on the web, so the question is, what performance impact the implemenation of .hashCode() will have. I did some tests...

Several APIs demand that the user is implementing the .hashCode() method. The reason is that these APIs are using hash based containers (like HashMap) to have a fast means of managing lots of objects (always comparing objects using .equals() would need endless time). There are lots of standard implementations on the web, so the question is, what performance impact the implemenation of .hashCode() will have. I did some tests and here are the results.

My test program uses a simple, nevertheless non-trivial, "PrimaryKey" class. This class has two integers (a and b). In a loop, I created one million of those "PrimaryKeys" and .put(PK, REF) them into a presized HashMap. Then, I tried to find them again using .get(PK). Both loops will access each of the one million objects exact once, and the time for .put() and .get() is measured separately.

Using this test mockup I tried several algorithms. Here is the interesting result:

`public final int hashCode() {`

// return this.a + this.b; // 32876, 33578

// return this.a ^ this.b; // 48845, 48205

// return this.a; // 11016, 14266

// return this.b; // 49658, 49626 (worst case)

// return this.a * 65536 + this.b; // 1672, 2437 (optimum, less readable)

// return this.a + this.b * 65536; // 2063, 4063

return this.a << 16 ^ this.b; // 1672, 2453 (optimum, well readable)

// return this.a ^ this.b << 16; // 2204, 4609

}

The left number is the time for .put(), the right for .get(). As you can see, the algorithm has huge impact on performance. While the best implementation needs less *than two seconds* to write and less than three seconds to read all one million instances (what, BTW is fascinating, since this is Java - the language that everybody calls "slow"), a typically used algorithm like return "this.a ^ this.b;" needs nealry fifty (!) seconds. What a difference! Not to mention the nearly endless time that a simple (and idiotic) "return 0;" will produce... (I cancelled that one after several minutes. Yes: MINUTES!).

Another interesting point is that there is no *general* optimum: Compare the last two algorithms and you will notice that both are essentially the same. While the first one (the optimum) is shifting a, the seconds one is shifting b. The answer is quite simple: Since our numbers range from 0 to 1000, shifting a leads to a strongly increased spread, while shifting b leads to weak spread increase (and the spread is what brings the performance). As the difference is 100% between those algorithms, this shows that the better one knows its data, the better the implementation can be optimized (and as a result, there is no general optimum).

Conclusion: There is no generally optimal solution, but "return this.a << 16 ^ this.b;" will be optimal in the most cases.

Another question is performance of Strings. Often the above algorithm cannot be used since the "PrimaryKey" class contains Strings. So what to do then? I changed my test to provide some more numbers.

The test program was changed only slightly: While the general mockup is exactly the same, just the implementation of the "PrimaryKey" class is changed to internally store not integers but Strings (and certainly, .hashCode() and .equals() are modified).

Using that changed mockup I tried several String algorithms. Here is the result:

`public final int hashCode() {`

// return this.a.hashCode() << 16 + this.b.hashCode(); // 36235, 35642 (worst case)

// return (this.a + this.b).hashCode(); // 1360, 250 (skew!)

// return (this.b + this.a).hashCode(); // 1515, 375 (skew!)

// return (this.a + '|' + this.b).hashCode(); // 1156, 485 (good)

// return (this.a + "|" + this.b).hashCode(); // 1156, 485 (good)

return (this.a + "|~" + this.b).hashCode(); // 1156, 266 (optimum?)

}

I started with the optimal solution of the integer test and was shocked to see that it is actually the worst case for Strings. The reason seems to be that a call to String.hashCode() (possibly any method of any class?) needs some fixed base time independent of the String's length. And that time seems to be very huge. Adding the Strings before calling .hashCode() works better, but actually is highly dependent of your data (since "1" + "11" will be the same than "11" + "1", what might lead to skew spread and (as a result) very indeterministic results (you shouldn't do that in reality). To prevent this problem, the typical solution is to add a "seldomly used character" (mostly the pipe symbol) between both, to get "1|11" and "11|1" (both having different hash codes). The test proofs that the result still is good (while there is no test to proof for the skew, since this was out of scope of my tests). But what really funny is (and what nobody can every explain): Adding *two* "seldomly used" characters further improves performance! I hardly couldn't believe it, but it is true as you can see above. But be warned: That is just by pure incidence I think. I would nobody give the tip to add another character without having a good explanation for it. So the well known String algorithm (adding a pipe symbol) works pretty well, but due to the actual data you are using it might not be the optimal one.

Conclusion: There is no generally optimal solution, but "return (this.a + '|' + this.b).hashCode();" should provide very good results in most cases.