| |||||||
Do you lock your front door at night? You're probably aware that a locked front door won't stop an accomplished thief for more than a few seconds. Add another lock and you slow him down another minute or two. Truth is, no matter how well you lock your doors, a determined thief is going to get in. So why do you lock your doors at all? The answer is (I think) that you've accepted the fact that a determined thief can get in--but the harder you can make it for him, the better. In fact, if it's enough of a pain, he may just go elsewhere. In addition, by locking your door you keep out the other 99 percent of the human race. From amateur thieves all the way down to your nosy Aunt Edna, every lock you add increases the time it takes for an unwanted guest to get in. For a thief that may just be a few more seconds; for Aunt Edna, it's probably billions of years. Regardless--door locks aren't about perfect protection, they're about raising the bar to entry.
Protecting software from reverse engineering works the same way. Java is wonderfully easy to effectively reverse engineer. (See the References section at the end of this article for tools that will let you do so literally at the click of a button.) You can decompile the code, crack the license, change the copyright to your name, or simply check out how the application hits the database. Reverse engineering based on compiled code is possible, but reverse engineering based on source is simply cake. The good news is that you can make the process harder--in fact, much harder. Protecting Java code raises the bar--it proverbially "adds a few locks on the door." If your code is worth protecting, there's no reason not to make the thief's job harder.
Obfuscation works to transform programs in such a way as to not affect what the program does, while at the same time impeding reverse engineering efforts. In a nutshell, obfuscating transforms have a goal of destroying relationships that exist between the compiled and source-code versions of the code. Although initially the idea of actually making program code harder to understand seemed counterintuitive, it is now seen as a valuable component in computer security.
The goals of obfuscation might not be what you think. The usual advertisement is the protection of program code. That's all well and good, but there are shades of meaning to that statement. Obfuscation is often compared to encryption, which is unfortunate, since the goals of each are rather different. Encryption is a lossless, two-way transformation that provides strong protection for hiding data. Obfuscation, on the other hand, is (usually) a lossy, one-way transformation of data (and in our context, data is primarily executable code). Obfuscation exists because encryption doesn't apply to the problem of protecting software code that will be executed. Broadening that statement, we need a methodology of protecting data that must retain its meaning while remaining in full view of all observers. Our hope is that only one observer (in our case, the Java virtual machine) will get the true message. Obfuscation is weaker in this sense because it's always circumventable (with enough persistence and know-how applied). However, it also must always be contended with. Most obfuscating transforms can never be undone--as we said, they're one-way, lossy transformations.
One goal is the prevention of any human understanding reverse-engineered code. The best you can do here is make understanding hard--it can't be foolproof. After all, a diligent human can simply disassemble the code and work from there (they can do this with Java or any other language, for that matter). Computer game programmers are experts in this area. They made huge attempts to thwart the abilities of crackers to circumvent the licensing schemes of their games. To this day they've mostly failed. However, they played on two primary factors to mitigate their losses. One is that the most important sales window for a game is typically the first two months after release. Assuming that the cracking of the game is inevitable, the goal focuses on delaying the crack at least that long.
Secondly, like any other security against humans, they play on the fact that humans are fickle and quick to tire. The most effective tactic here is to provide tiers of security. In other words, let the hacker think they have successfully cracked the game in short order. Only a day or two later (after bragging to friends) will they discover that they really haven't. Subsequent attempts will garner more and more frustration.
Another goal of code protection, at least in languages like Java, is obfuscation to
prevent recompilation. Here, obfuscation has had notable success. Simple renaming (see below)
can do a wonderful job of making a decompile-modify-recompile cycle nothing short of a
nightmare. An old trick
is to rename fields and methods to Java keywords like for, try, and catch. In compiled Java, these
keywords have no meaning. Of course, when restored back to Java source code, compilers
have little pity on variable names that collide with keywords. Advanced
obfuscation techniques such as control-flow obfuscation take the problem even further.
Finally, it sure would be nice if an obfuscator's output could simply crash reverse-engineering tools. Although this sounds formidable, it's actually surprisingly easy. This probably comes from the fact that decompilers, like most software, are relatively easy to write for the general case and often depend on the fragile and predictable relationships Java compilers create between Java source and bytecode. Throw in some special cases (which we'll see shortly) and those relationships fall apart.
Interestingly, obfuscators have evolved as full static analyzers of Java applications (as far as those applications can be statically analyzed, anyway). Because of this, additional features such as code pruning, size reduction, optimization, and watermarking have become natural extensions to obfuscation systems.
Obfuscation is certainly not perfect protection--it can't be, since some type of executable will always be visible. However, as a cheap and fast extra layer of protection, it has little downside.
The primary methodology of simple obfuscators is to rename
program identifiers according to a one-to-one mapping. That is,
if the identifier getPayroll existed within a
program, all instances of that name anywhere in the program
could be renamed to xyz. In some ways, this is still
the simplest and strongest form of pure obfuscation available.
Renaming types, methods, and fields represents a lossy
transformation of information. Of course, a computer (or
runtime) couldn't care less what you name your identifiers; it
uses them for pattern matching and nothing more.
Renaming a method from getPayroll
to xyz obviously represents a loss
of information (for humans). In one instance, it's clear what a
given function does, in another, no clues are given. Obviously,
the original programmer still has the original
getPayroll version, which is an important point;
almost exclusively, obfuscators work on compiled code, not source
code. From this vantage point, they can leave the original
authors with untouched versions of their code and provide a
protected distributable for customers. In addition, obfuscators
often provide tools or mapping files so that debugging of such
garbled method names is just as easy as it was before renaming.
An advanced type of renaming is called overload induction (OI). Instead of a simple one-to-one mapping for method renames, OI works by analyzing hierarchical relationships between methods and looks for opportunities to rename methods to equivalent names. In other words, using an OI renamer, sometimes 50 percent of all methods in an application may be renamed to exact same name. Overload induction takes advantage of the fact that methods are distinguished not only by name but also by parameter list (among other factors). Therefore, it's possible to rename two methods with differing parameter lists to the same name--effectively inducing overloading (hence the name).
The effect is dramatic. For one thing, many obfuscators are also application-packaging systems that remove dead code and prune hierarchy trees to create streamlined output. OI adds redundancy to the set of method names, reducing their total number. This saves additional space in the final version.
OI represents a loss-of-information transform, much like simpler renaming schemes. However, OI also destroys extra information involved in method relationships that is lossy (and irreversible). For example, let's say we have the following method name transformations:
public long getPayroll(int x) { } → abc
public long getPayroll() { } → xyz
public long saveState(int x) { } → xyz
public long getDate() { } → abc
The right-hand sides of the arrows represent a possible OI
renaming scheme. The interesting point is that where
getPayroll had two overloaded versions, those methods
are now renamed to different names. That is, the original
overloading relationship is gone. Also, methods that were
previously unrelated now have overloading relationships. Assuming
that method overloading is a tool that conveys information (i.e.
the two original getPayroll methods were different
implementations of the same task), that information is now
gone.
A smart decompiler can undo OI to some degree; however, it can never recreate original overload relationships. That information is lost. Regardless of claims, it's also rare that a decompiler can decompile OI code enough for it to be re-compiled. In order for a decompiler to fully undo OI renaming, it would need to run OI itself. If it doesn't, illegal method relationships tend to crop up throughout a program's hierarchy.
A colloquial term for obfuscators is "code manglers," although in truth, most obfuscators don't actually do anything to program code (much less mangle it). Obfuscators that do "mangle code" provide another level of protection. For one thing, it's usually not possible to rename identifiers in reusable class libraries. Such libraries depend upon their identifiers' names for interfacing with external programs. However, it's still possible to obfuscate at a code level by actually transforming the code itself. It's obvious that there is more than one way to do some task--it probably isn't hard to imagine that there even more ways to do it in some lower-level, intermediate representation. Much work in advanced obfuscation was pioneered by Christian Collberg currently at the University of Arizona.
Consider a bytecode transformation called transient variable caching (TVC). This transformation looks for opportunities to remove references to transient variables by utilizing the stack that the Java runtime uses. The effect is the removal of inevitably unneeded bytecode instructions. The canonical example is a simple swap of two integers:
int temp = a;
a = b;
b = temp;
This compiles into bytecode into something similar to:
push a
pop temp
push b
pop a
push temp
pop b
This example is not actual bytecode, but is representative and hopefully more understandable. Simply, each assignment is a pair of instructions that push the value of the local variable onto the stack and pop it off, storing in another.
After TVC, this sequence would appear as the much more logical:
push a
push b
pop a
pop b
Effectively, instead of the temp variable, the stack is used as
transient storage. Sounds good. Now what will a decompiler make of this? There is no
way in Java to "save the value of a on the stack for use later." Amusingly,
many decompilers decompile this code as:
a = b;
b = a;
JUnit is going to throw a hissy over that one. A decompiler writer is forced to solve a large set of problems, and adding curveballs like this only makes his job harder. In fact, some simple transforms would require full data-flow analysis to undo--that's a feature most decompilers don't have, and it isn't terribly compelling to add to counter one feature of one obfuscator.
Sometimes, even classic optimizing transforms do a fine job of
obfuscation. For another
example, consider the following for loop:
for (int g=0;g<x;++g) {
// loopbody
}
A common compiler optimization is to perform loop-unrolling.
By unrolling the loop, you can remove the loop overhead
(incrementing and comparing g, in the above case). If
the loop iterates to a small known value, then you can simply
unroll the loop that many times and remove the loop altogether.
For example, if we have 5 instead of x
in the above example, the loop could become:
// loopbody[1]
// loopbody[2]
// loopbody[3]
// loopbody[4]
// loopbody[5]
This isn't a loop at all, but we save the loop overhead.
Note we have labeled each loop body with a number, but the each
loop body is an identical piece of code. In the original example
however, we did not have a loop limit of 5; we had x,
an apparent unknown value. For this type of loop unrolling, we
can implement something called Duff's Device, which would look like:
int y = x%5;
x = x/5;
int g=0;
if (y>0) goto loopbody[6-y];
for (;g<x;++g) {
// loopbody[1]
// loopbody[2]
// loopbody[3]
// loopbody[4]
// loopbody[5]
}
By following the logic you'll see that y
represents the remainder number of iterations mod 5. Those
iterations are handled first and x
represents x*5 more iterations. So,
if before this code x=36, then y=1
and x=7 as the loop begins. The goto jumps into the
for loop at loopbody[5] and then the main loop
iterates seven more times (representing 35 more loop bodies).
The most intriguing part of this transformation is not the
performance improvement you obtain, but the fact that this code isn't
possible in a language like
Java. The problem, of course, is that Java has no
goto. Conversely, the compiled representations of
Java (bytecode) fully support actual goto instructions. Since
obfuscators work at this level, they can introduce gotos as
needed.
Because of the lack of goto in some high-level languages,
there is no way for a decompiler to succinctly transform this
code (in a general case) back to source code. If a generalized
algorithm introduced similar constructs into compiled Java code,
most decompilers wouldn't have a chance. Specifically, only one
decompiler is known to be able to decompile the above obfuscated
code correctly. It is a Java decompiler named Sourceagain, from
Ahpah software. Even then, once decompiled, the above code
sequence is decompiled into about 250 lines of source code to
represent an algorithmically correct (albeit rather dense)
decompilation.
It's actually rather cathartic writing a control-flow obfuscator. The reason is that when you create a brand new control-flow (or data-flow) transformation, no decompiler will be ready for it. As you test your new transformation through a set of decompilers, they often crash wildly. It's the obfuscators that set the stage for what is done to code. Decompilers must wait to see what comes along and play catch-up.
Another common feature in today's obfuscators is string encryption. That is, an obfuscator will encrypt all string literals within a program and then decrypt them during runtime, immediately before their use. String encryption represents one of the few popular, lossless transformations within a obfuscator.
There is a classic problem that rears its head in the act of delivering encrypted software. On one hand, you provide your customers a program so they can run it, but on the other hand, you don't want them peering into its internals. If you encrypt it, you have bought yourself little, since in order for them to run it, it will at some point need to be decrypted. In current computing systems, this problem is insurmountable. If you allow a program to be decrypted to be run, it can also be intercepted and examined. Not only must you provide the encryption key, you're providing some code to actually do the encryption. Even the strongest encryption algorithm known to man is completely useless if you provide the enemy the key.
This then begs the question, "What good is encrypting string literals within a program?" This question actually backs into a philosophical question of code protection as a whole. It is of course possible to completely reverse engineer a C program. It is far harder than reverse engineering a Java program, but it's still possible. Now, obfuscation never became a real interest for C developers. Why? The probable answer is that reverse engineering C is so difficult that that difficulty provided its own inherent protection. Only a scant few people would go to the trouble of reverse engineering such a program.
By using a simple and fast encryption algorithm to encrypt all of a program's string literals, you force a reverse engineer to take the extra step of undoing the encryption. Keep in mind that a weak and a strong encryption are arguably equal to each other in this context. Since the strings much be decrypted before runtime, the keys for decryption must exist somewhere in the program. A hacker isn't going to decrypt the strings via brute force; they'll simply examine the code until they find the key. With that in mind, obfuscators generally opt for weaker and faster encryption. String encryption is one of the few techniques that may hurt performance, so it's important that it be fast.
The biggest argument against obfuscation I've seen is that someone proclaims their software doesn't need it. That's a pretty solid argument. Certainly open source, free-to-use, or sell-with-source applications don't need to protect their source. However, it shouldn't be hard to imagine other scenarios where government or businesses need as much protection as they can get. Government regulations have dictated that certain types of encryption code be protected--that's why the JCE API in Sun's JDK is obfuscated.
As we've said, obfuscators have a good technical and business reason to stay ahead of decompilers. Given the additional features such as code-pruning, optimization, packaging, and code size-reduction, there is little reason not to obfuscate Java code. Just like makefiles, source control, and optimizers, obfuscation is destined to be a regular step in your future build processes. You can be confident that anyone trying to reverse engineer your code is going have a bad day--and without a doubt, that's the last we'll see of good old Aunt Edna.
COLLBERG: Christian Collberg, Clark Thomborson, Watermarking, Tamper-Proofing, and Obfuscation--Tools for Software Protection, IEEE Transactions on Software Engineering 28:8, 735-746, August 2002
JODE Decompiler [8]
Duff's Device [10]: Tom Duff
Game Cracks [11]
For more information on obfuscators, see preemptive.com/obfuscator.html [12].
Links:
[1] http://www.java.net/author/paul-tyma
[2] http://www.java.net/article/2004/10/18/new-obfuscation#Goals_of_Software_Protection
[3] http://www.java.net/article/2004/10/18/new-obfuscation#Renaming
[4] http://www.java.net/article/2004/10/18/new-obfuscation#Code_Manglers
[5] http://www.java.net/article/2004/10/18/new-obfuscation#Encrypted
[6] http://www.java.net/article/2004/10/18/new-obfuscation#A_Step_Ahead
[7] http://www.java.net/article/2004/10/18/new-obfuscation#References
[8] http://jode.sourceforge.net
[9] http://www.ahpah.com
[10] http://www.lysator.liu.se/c/duffs-device.html
[11] http://www.gamecopyworld.com
[12] http://www.preemptive.com/obfuscator.html