Do you lock your front door at night? You're probably aware that a locked
front door won't stop an accomplished thief for more than a few seconds. Add
another lock and you slow him down another minute or two. Truth is, no matter
how well you lock your doors, a determined thief is going to get in. So why
do you lock your doors at all? The answer is (I think) that you've accepted
the fact that a determined thief can get in--but the harder you can make it
for him, the better. In fact, if it's enough of a pain, he may just go elsewhere.
In addition, by locking your door you keep out the other 99 percent of the human race. From amateur
thieves all the way down to your nosy Aunt Edna, every lock
you add increases the time it takes for an unwanted guest to get in. For a
thief that may just be a few more seconds; for Aunt Edna, it's probably billions
of years. Regardless--door locks aren't about perfect protection, they're about
raising the bar to entry.
Protecting software from reverse engineering works the same way. Java is wonderfully
easy to effectively reverse engineer. (See the References section at the end of this article for tools that will let
you do so literally at the click of a button.) You can decompile the code, crack the license,
change the copyright to your name, or simply check out how the application hits the database.
Reverse engineering based on compiled code is possible, but reverse engineering based
on source is simply cake. The good news is that you can
make the process harder--in fact, much harder. Protecting Java code raises the
bar--it proverbially "adds a few locks on the door." If your code is worth
protecting, there's no reason not to make the thief's job harder.
Obfuscation works to transform programs in such a way as to
not affect what the program does, while at the same time impeding
reverse engineering efforts. In a nutshell, obfuscating
transforms have a goal of destroying relationships that exist
between the compiled and source-code versions of the code.
Although initially the idea of actually making program code
harder to understand seemed counterintuitive, it is now
seen as a valuable component in computer security.
The Goals of Software Protection
The goals of obfuscation might not be what you think. The usual advertisement
is the protection of program code. That's all well and good, but there are
shades of meaning to that statement. Obfuscation is often compared to encryption,
which is unfortunate, since the goals of each are rather different.
Encryption is a lossless, two-way transformation
that provides strong protection for hiding data. Obfuscation, on the other hand,
is (usually) a lossy, one-way transformation of data (and in our context, data
is primarily executable code). Obfuscation
exists because encryption doesn't apply to the problem of protecting software
code that will be executed. Broadening that statement, we need a methodology
of protecting data that must retain its meaning while remaining in full
view of all observers. Our hope is that only one observer (in our case,
the Java virtual machine) will get the true message.
Obfuscation is weaker in this sense because it's
always circumventable (with enough persistence and know-how applied).
However, it also must always be contended with. Most obfuscating transforms
can never be undone--as we said, they're one-way, lossy transformations.
One goal is the prevention of any human understanding reverse-engineered
code. The best you can do here is make understanding hard--it can't be foolproof.
After all, a diligent human can simply disassemble the code and work from there
(they can do this with Java or any other language, for that matter). Computer game
programmers are experts in this area. They made huge attempts to thwart the abilities
of crackers to circumvent the licensing schemes of their games. To this day they've
mostly failed. However, they played on two primary factors to mitigate their losses.
One is that the most important sales window for a game is typically the first two months after release.
Assuming that the cracking of the game is inevitable, the goal focuses on delaying
the crack at least that long.
Secondly, like any other security against humans, they play on the fact that
humans are fickle and quick to tire. The most effective tactic here
is to provide tiers of security. In other words, let the hacker think they have
successfully cracked the game in short order. Only a day or two later (after bragging
to friends) will they discover that they really haven't. Subsequent attempts will
garner more and more frustration.
Another goal of code protection, at least in languages like Java, is obfuscation to
prevent recompilation. Here, obfuscation has had notable success. Simple renaming (see below)
can do a wonderful job of making a decompile-modify-recompile cycle nothing short of a
nightmare. An old trick
is to rename fields and methods to Java keywords like for, try, and catch. In compiled Java, these
keywords have no meaning. Of course, when restored back to Java source code, compilers
have little pity on variable names that collide with keywords. Advanced
obfuscation techniques such as control-flow obfuscation take the problem even further.
Finally, it sure would be nice if an obfuscator's output could simply crash reverse-engineering tools. Although this sounds formidable, it's actually surprisingly easy.
This probably comes from the fact that decompilers, like most software, are
relatively easy to write for the general case and often depend on the fragile and
predictable relationships Java compilers create between Java source and bytecode.
Throw in some special cases (which we'll see shortly) and those relationships fall apart.
Interestingly, obfuscators have evolved as full static analyzers of Java applications (as
far as those applications can be statically analyzed, anyway). Because of this, additional features
such as code pruning, size reduction, optimization, and watermarking have become natural
extensions to obfuscation systems.
Obfuscation is certainly not perfect protection--it can't be, since some type of
executable will always be visible. However, as a cheap and fast extra layer of protection,
it has little downside.
Renaming is Old School
The primary methodology of simple obfuscators is to rename
program identifiers according to a one-to-one mapping. That is,
if the identifier getPayroll existed within a
program, all instances of that name anywhere in the program
could be renamed to xyz. In some ways, this is still
the simplest and strongest form of pure obfuscation available.
Renaming types, methods, and fields represents a lossy
transformation of information. Of course, a computer (or
runtime) couldn't care less what you name your identifiers; it
uses them for pattern matching and nothing more.
Renaming a method from getPayroll
to xyz obviously represents a loss
of information (for humans). In one instance, it's clear what a
given function does, in another, no clues are given. Obviously,
the original programmer still has the original
getPayroll version, which is an important point;
almost exclusively, obfuscators work on compiled code, not source
code. From this vantage point, they can leave the original
authors with untouched versions of their code and provide a
protected distributable for customers. In addition, obfuscators
often provide tools or mapping files so that debugging of such
garbled method names is just as easy as it was before renaming.
An advanced type of renaming is called overload
induction (OI). Instead of
a simple one-to-one mapping for method renames, OI works by
analyzing hierarchical relationships between methods and looks
for opportunities to rename methods to equivalent names. In other
words, using an OI renamer, sometimes 50 percent of all methods in an
application may be renamed to exact same name.
Overload induction takes advantage of the fact that
methods are distinguished not only by name but also by parameter
list (among other factors). Therefore, it's possible to rename
two methods with differing parameter lists to the same name--effectively inducing overloading (hence the name).
The effect is dramatic. For one thing, many obfuscators are
also application-packaging systems that remove dead code and
prune hierarchy trees to create streamlined output. OI adds
redundancy to the set of method names, reducing their total number.
This saves additional space in the final version.
OI represents a loss-of-information transform, much like
simpler renaming schemes. However, OI also destroys extra
information involved in method relationships that is lossy
(and irreversible). For example, let's say we have the following
method name transformations:
public long getPayroll(int x) { } → abc
public long getPayroll() { } → xyz
public long saveState(int x) { } → xyz
public long getDate() { } → abc
The right-hand sides of the arrows represent a possible OI
renaming scheme. The interesting point is that where
getPayroll had two overloaded versions, those methods
are now renamed to different names. That is, the original
overloading relationship is gone. Also, methods that were
previously unrelated now have overloading relationships. Assuming
that method overloading is a tool that conveys information (i.e.
the two original getPayroll methods were different
implementations of the same task), that information is now
gone.
A smart decompiler can undo OI to some degree; however, it can
never recreate original overload relationships. That information
is lost. Regardless of claims, it's also rare that a decompiler
can decompile OI code enough for it to be re-compiled. In order
for a decompiler to fully undo OI renaming, it would need to run
OI itself. If it doesn't, illegal method relationships tend to
crop up throughout a program's hierarchy.
Code Manglers
A colloquial term for obfuscators is "code manglers," although
in truth, most obfuscators don't actually do anything to program
code (much less mangle it). Obfuscators that do "mangle code"
provide another level of protection. For one thing, it's usually
not possible to rename identifiers in reusable class libraries.
Such libraries depend upon their identifiers' names for
interfacing with external programs. However, it's still possible
to obfuscate at a code level by actually transforming the code
itself. It's obvious that there is more than one way to do some
task--it probably isn't hard to imagine that there even more
ways to do it in some lower-level, intermediate representation.
Much work in advanced obfuscation was pioneered by Christian Collberg
currently at the University of Arizona.
Consider a bytecode transformation called transient variable caching (TVC).
This transformation looks for opportunities
to remove references to transient variables by utilizing the stack that the
Java runtime uses. The effect is the removal of inevitably unneeded
bytecode instructions. The canonical example is a simple swap of two
integers:
int temp = a;
a = b;
b = temp;
This compiles into bytecode into something similar to:
push a
pop temp
push b
pop a
push temp
pop b
This example is not actual bytecode, but is representative and hopefully
more understandable. Simply, each assignment
is a pair of instructions that push the value of the local variable onto the
stack and pop it off, storing in another.
After TVC, this sequence would appear as the much more logical:
push a
push b
pop a
pop b
Effectively, instead of the temp variable, the stack is used as
transient storage. Sounds good. Now what will a decompiler make of this? There is no
way in Java to "save the value of a on the stack for use later." Amusingly,
many decompilers decompile this code as:
a = b;
b = a;
JUnit is going to throw a hissy over that one. A decompiler writer is forced to solve a
large set of problems, and adding curveballs like
this only makes his job harder. In fact, some simple transforms would require full data-flow
analysis to undo--that's a feature most decompilers don't have, and it isn't terribly compelling
to add to counter one feature of one obfuscator.
Sometimes, even classic optimizing transforms do a fine job of
obfuscation. For another
example, consider the following for loop:
for (int g=0;g<x;++g) {
// loopbody
}
A common compiler optimization is to perform loop-unrolling.
By unrolling the loop, you can remove the loop overhead
(incrementing and comparing g, in the above case). If
the loop iterates to a small known value, then you can simply
unroll the loop that many times and remove the loop altogether.
For example, if we have 5 instead of x
in the above example, the loop could become:
// loopbody[1]
// loopbody[2]
// loopbody[3]
// loopbody[4]
// loopbody[5]
This isn't a loop at all, but we save the loop overhead.
Note we have labeled each loop body with a number, but the each
loop body is an identical piece of code. In the original example
however, we did not have a loop limit of 5; we had x,
an apparent unknown value. For this type of loop unrolling, we
can implement something called Duff's Device, which would look like:
int y = x%5;
x = x/5;
int g=0;
if (y>0) goto loopbody[6-y];
for (;g<x;++g) {
// loopbody[1]
// loopbody[2]
// loopbody[3]
// loopbody[4]
// loopbody[5]
}
By following the logic you'll see that y
represents the remainder number of iterations mod 5. Those
iterations are handled first and x
represents x*5 more iterations. So,
if before this code x=36, then y=1
and x=7 as the loop begins. The goto jumps into the
for loop at loopbody[5] and then the main loop
iterates seven more times (representing 35 more loop bodies).
The most intriguing part of this transformation is not the
performance improvement you obtain, but the fact that this code isn't
possible in a language like
Java. The problem, of course, is that Java has no
goto. Conversely, the compiled representations of
Java (bytecode) fully support actual goto instructions. Since
obfuscators work at this level, they can introduce gotos as
needed.
Because of the lack of goto in some high-level languages,
there is no way for a decompiler to succinctly transform this
code (in a general case) back to source code. If a generalized
algorithm introduced similar constructs into compiled Java code,
most decompilers wouldn't have a chance. Specifically, only one
decompiler is known to be able to decompile the above obfuscated
code correctly. It is a Java decompiler named Sourceagain, from
Ahpah software. Even then, once decompiled, the above code
sequence is decompiled into about 250 lines of source code to
represent an algorithmically correct (albeit rather dense)
decompilation.
It's actually rather cathartic writing a control-flow
obfuscator. The reason is that when you create a brand new
control-flow (or data-flow) transformation, no decompiler will be
ready for it. As you test your new transformation through a set
of decompilers, they often crash wildly. It's the obfuscators
that set the stage for what is done to code. Decompilers must
wait to see what comes along and play catch-up.
If Strings are Encrypted, Only Outlaws Will e5we$#%Gg
Another common feature in today's obfuscators is string
encryption. That is, an obfuscator will encrypt all string
literals within a program and then decrypt them during runtime,
immediately before their use. String encryption represents one of the few
popular, lossless transformations within a obfuscator.
There is a classic problem that rears its head in the act of
delivering encrypted software. On one hand, you provide your
customers a program so they can run it, but on the other hand, you
don't want them peering into its internals. If you encrypt it,
you have bought yourself little, since in order for them to run
it, it will at some point need to be decrypted. In current
computing systems, this problem is insurmountable. If you allow a
program to be decrypted to be run, it can also be intercepted and
examined. Not only must you provide the encryption key, you're providing
some code to actually do the encryption. Even the
strongest encryption algorithm known to man is completely useless
if you provide the enemy the key.
This then begs the question, "What good is encrypting string
literals within a program?" This question actually backs into a
philosophical question of code protection as a whole. It is of
course possible to completely reverse engineer a C program. It is
far harder than reverse engineering a Java program, but it's still
possible. Now, obfuscation never became a real interest for
C developers. Why? The probable answer is that reverse
engineering C is so difficult that that difficulty provided its
own inherent protection. Only a scant few people would go to the
trouble of reverse engineering such a program.
By using
a simple and fast encryption algorithm to encrypt all of a
program's string literals, you force a reverse engineer to take
the extra step of undoing the encryption. Keep in mind that a
weak and a strong encryption are arguably equal to each other in
this context. Since the strings much be decrypted before runtime,
the keys for decryption must exist somewhere in the program. A
hacker isn't going to decrypt the strings via brute force;
they'll simply examine the code until they find the key. With
that in mind, obfuscators generally opt for weaker and
faster encryption. String encryption is one of the few techniques
that may hurt performance, so it's important that it be fast.
A Step Ahead
The biggest argument against obfuscation I've seen is that someone
proclaims their software doesn't need it. That's a pretty solid argument.
Certainly open source, free-to-use, or sell-with-source applications
don't need to protect their source. However, it shouldn't be hard to
imagine other scenarios where government or businesses need as much
protection as they can get. Government regulations have dictated that
certain types of encryption code be protected--that's why the JCE
API in Sun's JDK is obfuscated.
As we've said, obfuscators have a good technical and
business reason to stay ahead of decompilers. Given the
additional features such as code-pruning, optimization,
packaging, and code size-reduction, there is little reason not to
obfuscate Java code. Just like makefiles, source
control, and optimizers, obfuscation is destined to be a regular
step in your future build processes. You can be confident that anyone
trying to reverse engineer your code is going have a bad day--and without a doubt, that's the
last we'll see of good old Aunt Edna.
References
Paul Tyma is vice president and chief scientist of PreEmptive Solutions, Inc., a company specializing in software code protection. He received his Ph.D. in computer engineering from Syracuse University.