Posted by skelvin
on January 13, 2006 at 6:53 AM PST
Make string comparisons a little more clever by correctly sorting contained numbers. Hey, Windows XP finally sorts files intelligently, e.g. picture 9.jpg < picture 10.jpg < picture 11.jpg.
So your Java software should be able to avoid stupidities like picture 1.jpg < picture 10.jpg < picture 2.jpg
Treats contained numbers sensible when comparing strings
(e.g. implemented in Windows XP's file explorer)
Java's default Order
(Kudos to Martin Pool for the example.)
When I needed a natural sort feature in my Java app I started to google and to look in a few other places
(like Apache Commons).
There should be plenty of code out there that does this, right? Nope.
I found a page with a description of the problem and links to implementations in a couple of programming languages:
Natural Order String Comparison
There's a single Java implementation, but that one cannot handle texts that contain non-ascii characters.
So it would fail in my German application, where texts contain funny characters such as these:
Im Ãœbrigen macht's mehr SpaÃŸ, es selbst zu schreiben.
Consists of a couple of static methods in class Strings: Strings.java
Here is the Javadoc .
Some junit tests: TestStrings.java
Basically there are some variants of methods
compareNatural...(String, String) that directly compare
strings and convenience function, if you need a string
The different variants either use either the current default locale's comparison rules, allow to specify a
to use or work on 7-bit ascii characters only.
Java 1.5: There is a little bit of generics used (e.g.
Comparator<String>). It's easy to change
if you require the code to work on older Java versions.
Basics and Intricacies
Basically natural comparison works by breaking each string into parts where each part is either a number or text-only.
Number parts are compared according to their corresponding numeric value while text parts are compared lexicographically.
However simple this seems, there are quite some subtleties to handle.
A nice, detailed explanation is also given in this spec from the OpenOffice project:
Natural Sort Option in Sort Dialog
Exactly how to determine which parts of a text are numbers can be surprisingly difficult. The two complicating issues
are non-integer numbers and internationalization.
My implementation handles only integer numbers, so "a 1.3" is considered less than "a 1.11" because each is broken
into four parts ("a ", "1", "." and "3" resp. "11") and 3 is less than 11.
The need to handle fractional numbers should arise very rarely and trying to handle these opens a can of worms
(different decimal point representations in different locales, scientific formats, ...).
Digits are defined differently in different locales. Currently my implementation starts a new token whenever the
value of Character.isDigit(c) changes.
My minimum requirement was to compare the text parts of the strings using locale specific rules (which means you can
The code should work fine for any language that uses the number characters '0'...'9'.
Other number systems are not tested (they will probably not work, because I do not really convert the
number to a numeric value, but compare each digit separately).
Once upon a time it seemed to me that Java's native unicode support was the end of all character set problems.
No more, no more: The complete Unicode definition now contains character that cannot be represented by a single (16 bit)
char variable, see this article from sun
These supplementary characters are simply not handled correctly in my natural comparison implementation.
Numbers with leading zeros
How should the strings "p0002", "p2", "p02" be sorted?
Strictly comparing the numeric values of the number parts would leave the sort order undefined, as each of these strings
will be considered equal.
I have chosen to sort by increasing numbers of leading zeros: "p2", "p02", "p0002". That way
String.equals(String) are still compatible:
compare() will return 0 exactly if
How do "p4" and "p 3" compare?
I have choosen not to ignore whitespace, so "p4" is less than "p 3". I'd like to be alerted to difference in
the naming scheme, so I think this is reasonable. Maybe later I'll add an option to control the handling of whitespace.
First of all I don't think performance really matters in the typical application of natural comparison. You probably
don't have hundreds of thousands of files in the same directory whose names differ only in a number, do you?
Still, the very simple approach of breaking the string into parts (effectively creating two list of strings) and then
comparing each string separately might be too costly.
So I took care to avoid creating new instances where possible and do all comparisons in place.
For many scenarios the natural comparison is about as fast as Java's String.compareTo() method. If there are many
strings that have a common prefix and contain a number that differs in trailing digits only the natural comparison
is about four times slower.