Skip to main content

Java NIO performance

25 replies [Last post]
mas7871
Offline
Joined: 2005-11-07

I'm new to Java and have been using Ivor Horton's Beginning Java 2 JDK 5 Edition to help me learn the language and API's. The chapter on file reading seemed a bit confusing in that it uses the NIO library to read files. It is explainded that this is the prefered method since it improves performance.

I want to process a pipe delimited file that has two fields per line. First, I wrote a program using a BufferedReader and using the .readLine() method and measured the time it took using start and end GregorianCalendar objects. Second, I wrote another program using a ByteBuffer and Channels to read the file and using the same technique as above to measure the time interval.

To my surprise the performance is the same(~6.5 seconds)! Could someone please help me understand why there wasn't any performance difference? (Also, I wrote a program in C# and it processed the file in 1.5 seconds).

Here is the code for both of my Java efforts:
ByteBuffer:

import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.GregorianCalendar;

public class Main
{
public static void main(String[] args)
{
int lineCount = 0;
GregorianCalendar start = null;
GregorianCalendar end = null;
String[] fields = null;
String field1 = null;
String field2 = null;
ByteBuffer buf = ByteBuffer.allocateDirect(65536);
FileInputStream inStream = null;
FileChannel inChannel = null;
StringBuffer strBuf = new StringBuffer();

try
{
inStream = new FileInputStream("myData.dat");
inChannel = inStream.getChannel();

start = new GregorianCalendar();

char c;
while(inChannel.read(buf) != -1)
{
buf.flip();
while(buf.position() < buf.limit())
{
c = (char)buf.get();
if (c == '\n')
{ // have a line from file
fields = strBuf.toString().split("[|]");
lineCount++;
field1 = fields[0].substring(0,8);
field2 = fields[1];
field1 += field2;
strBuf.delete(0, strBuf.length()); // clear string buffer
} // finished processing file line
else
{
strBuf.append(c); // another character from the file
}
}
// ByteBuffer exhausted, read more data
buf.clear();
} // end of file reading loop
end = new GregorianCalendar();
inStream.close();
}
catch (IOException e)
{
e.printStackTrace(System.err);
System.exit(1);
}

double elapsed = (end.getTimeInMillis() - start.getTimeInMillis())/1000.0;
System.out.println("Elapsed Time: " + elapsed);
}
}

BufferedReader:
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.GregorianCalendar;

public class Main
{
public static void main(String[] args)
{
int lineCount = 0;
GregorianCalendar start = null;
GregorianCalendar end = null;
String[] fields = null;

String field1 = null;
String field2 = null;

try
{
BufferedReader input = new BufferedReader(
new FileReader("myData.dat"));

start = new GregorianCalendar();
while(input.ready())
{
fields = input.readLine().split("[|]");
lineCount++;
field1 = fields[0].substring(0,8);
field2 = fields[1];
field1 += field2;

}
end = new GregorianCalendar();
input.close();
}
catch (IOException e)
{
e.printStackTrace(System.err);
System.exit(1);
}

double elapsed = (end.getTimeInMillis() - start.getTimeInMillis())/1000.0;
System.out.println("Elapsed Time: " + elapsed);
}
}

Sorry for such a huge post but I thought that including the code would help find any errors I made.

Please help a newbie understand. Thanks.

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
rickcarson
Offline
Joined: 2004-03-04

> rickcarson,
>
> Running your code several time yielded the following
> results:
>
> 5959
> 1783
> 1873
> 1883
>
> So the results are much improved. I assume the first
> run was an anomoly. A JIT thing?
>
> So, it just goes to show you that there's more than
> one way to get something done. Where is the
> performance improvement coming from? The
> LineNumberReader or not using the split method?

Well, I'll turn on smug mode anyway :D

The line number reader is an old friend of mine from way back when. In this case I was aiming for simplicity, correctness and *readability*. Which is similar to the well known saying 'do the simplest thing that could possibly work'.

However, if I was going to nitpick my own code, the very first thing I'd get upset about is the assumption that the pipe symbol always happens at the twelvth character.

In fact, to check this, I'd usually write a script similar to the one I gave you, but at each line I'd simply check to make sure that the twelvth character was actually a pipe, and if it wasn't then I'd toss my toys out of the cot. It's a couple of minutes work, but it is good to check your assumptions (or another way of putting it - 'leave nothing to fate').

I know that it is probably coming out of a database, and that the result of whatever op it was probably 'guarantess' that the first part is exactly eleven characters, but if there was even one that wasn't... you could spend hours or days trying to debug the darn thing.

olsonje
Offline
Joined: 2005-08-10

Some questions as I can't accept that the times should be that different. So why would you use the GregorianCalendar to get the current time versus the use of System.currentTimeMillis()? The cost is minimal for creation of the two Calendar objects, in that of nano seconds, but I don't see the reason to use it vs the currentTimeMillis.

Also can you time the split your using and find the average time for it? I don't know the impact of that, but I would hope its not to much?

So weird, and depressing regarding .net :(

Btw, I'm just a normal user, not a performance guy nor do i work for sun... Just a normal joe :)

olsonje
Offline
Joined: 2005-08-10

Darn work and it interferring, Sorry for my post after yours Scott! Was in middle of typing this when people started asking me things and I got delayed for a bit before I could finish it. Interesting about the two byte thing.

rickcarson
Offline
Joined: 2004-03-04

mas7871, could you try the following on your test data, and let me know how it compares? I am curious :D

import java.io.*;
public class LineTest {
public static void main (String[] args) {
try {
readFile("Text.txt");
} catch (Exception e) {
System.err.println("oops: " + e);
e.printStackTrace();
}
}
static void readFile(String filename) throws Exception {
long l = System.currentTimeMillis();
LineNumberReader lnr = new LineNumberReader(new FileReader(new File(filename)));
String s = lnr.readLine();
while (s != null) {
String left = s.substring(0,11);
String right = s.substring(12);
// insert your test op here - don't use system.out
// because it will be incredibly slow :D
// System.out.println(s);
// System.out.println(left);
// System.out.println(right);
s = lnr.readLine();
}
lnr.close();
System.out.println("" + (System.currentTimeMillis() - l));
}
}

mas7871
Offline
Joined: 2005-11-07

rickcarson,

Running your code several time yielded the following results:

5959
1783
1873
1883

So the results are much improved. I assume the first run was an anomoly. A JIT thing?

So, it just goes to show you that there's more than one way to get something done. Where is the performance improvement coming from? The LineNumberReader or not using the split method?

swpalmer
Offline
Joined: 2003-06-10

When running code that reads files from disk multiple times you have to be very careful. Chances are that after the first run the OS had the file cached in RAM. Of course, that could be a good thing for this test, it helps eliminate the speed of the disk itself from the equation.

rickcarson
Offline
Joined: 2004-03-04

> When running code that reads files from disk multiple
> times you have to be very careful. Chances are that
> after the first run the OS had the file cached in
> RAM. Of course, that could be a good thing for this
> test, it helps eliminate the speed of the disk itself
> from the equation.

Interesting point - in that case if the java test was run first, and then the c# test immediately afterwards, that would help explain the large discrepancy in times...?

denka
Offline
Joined: 2003-07-06

Since you actually have two delimiters, the "|" and the "\n", you might reorganize your code to use simple state machine. This will eliminate the need to split the string.

WRT to NIO, it's use is also justified in situations that would get heavily multithreaded had NIO not been used. Multiple requests get multiplexed to get processed by lesser number of threads on a socket, that's one area where there is gain from using NIO.

Message was edited by: denka

Darn! Espen seemed to type the solution in time I was editing suggestion!

kolstae
Offline
Joined: 2004-11-16

Hi!

Roll your own split with this code:

int idx = 0;
boolean pipeFound = false;
while (inChannel.read(buf) != -1) {
buf.flip();
while (buf.position() < buf.limit()) {
final char c = (char) buf.get();
if (c == '\n') {
idx = 0;
pipeFound = false;
lineCount++;
field1 = strBuf.toString();
strBuf.setLength(0);
} else if (idx++ < 8) {
strBuf.append(c);
} else if (c == '|') {
pipeFound = true;
} else if (pipeFound) {
strBuf.append(c);
}
}
buf.clear();
}

On 1,355,256 lines (29.7MB), this took 1.0... seconds on my machine AMD 3200+
While your original code took 6.6 on the same file.

It's fun though ;-)

Espen

ahmetaa
Offline
Joined: 2003-06-21

Some more notes:
- i think NIO is more proper for very large file operations, such as copying, reading binary date etc. It is not really proper to use it for this kind of text files and operations. Normal IO nowadays is as fast as NIO for many kind of operations and relatively small files. So i say use normal I/O.
- i strongly suggest using FileInputStream instead of FileReader. With InputStreamReader you can define the encoding. in my test, when i use ISO-8859-1, i got better performance.

BufferedReader input = new BufferedReader(
new InputStreamReader(new FileInputStream("data.dat"), "ISO-8859-1"));

- compiling the splitter pattern before using definitely makes it faster.

public static final Pattern SPLITTER = Pattern.compile("[|]");

and for usage:

String[] fields = SPLITTER.split(line); //or (line,2)

- not using the splitter at all is even faster.

Now the numbers:

i have a huge text file (128MB) with | delimited lines. Each line actually has 10 parameters but i use only first two. Total number of lines are: 2.18 million

if i only read the text from the file, it takes 2.4 seconds (i have a slow drive.)
if i make the string manipulation operations it takes 14 seconds. So we can assume that most of the time (%80) is used in string manipulation. As it is seen using NIO is unnecessary here.

if i do not define the encding as "ISO-8859-1", time of the reading file only goes from 2.4 seconds to 4.8 seconds. So, if your file is contianing ASCII characters, defining the encoding makes file access two times faster.

if i do not compile the splitter pattern, the operation time goes from 14 seconds to 17 seconds.

if i do not use the splitter, and do the thing suggested by a poster, find the delimiter location and apply a SubString operation there, i got the fastest result as 4 seconds. String operation part of the system is 5 TIMES faster now.

So here are the results:
- Java I/O is not slow. using NIO here is UNNECESSARY.
- The culprit in this code is the Splitter. You should use it, however, when extreme performance is needed write your own splitter. You can have 3-4 times better performance.
- Use Stram Readers, define a proper encoding if necessary.
- if you will use Regular expressions extensively, compile the patterns before using.
- if there are memory issues happening in this kind of big loops use

field1 = new String(fields[0].substring(0, 8));

instead of

field1 = fields[0].substring(0, 8);

for preventing interning.

- Giving a big initial Heap memory with such as -Xms32M helps the performance.

ahmetaa
Offline
Joined: 2003-06-21

A note: i used Java 5 update 5. i heard update 5 has some performance enhancements, i suggest using it.

ahmetaa
Offline
Joined: 2003-06-21

And here is the code:

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.regex.Pattern;

public class Main
{
public static final Pattern SPLITTER = Pattern.compile("[|]");
public static void main(String[] args)
{
int lineCount = 0;
long start=0;
String field1 = null;
String field2 = null;

try
{
BufferedReader input = new BufferedReader(
new InputStreamReader(new FileInputStream("flat_data/big.txt"), "ISO-8859-1"));

start = System.currentTimeMillis();
String line;

while ((line=input.readLine())!=null)
{
String[] fields = SPLITTER.split(line,2);
lineCount++;
field1 = new String(fields[0].substring(0, 5));
field2 = fields[1];
field1 += field2;
}
input.close();
}
catch (IOException e)
{
e.printStackTrace(System.err);
System.exit(1);
}

double elapsed = (System.currentTimeMillis() - start) / 1000.0;
System.out.println("Elapsed Time: " + elapsed + " lines:"+lineCount);
}
}

krishnan_mms
Offline
Joined: 2005-11-10

As somebody pointed out earlier, split uses regex. Changing the code (one using BufferedReader) to use indexOf insteadof split, makes it thrice as fast (in my old machine).

// Replace the code inside the inner loop with this.
String line= input.readLine();
int index = line.indexOf('|');
lineCount++;
field1 = line.substring(0, index).substring(0, 8);
field2 = line.substring(index + 1);
field1 += field2;

Also check whether you are doing exactly same thing in C#.

For instance removing the line,

field1 += field2;

make it about 30% faster(more).

ahmetaa
Offline
Joined: 2003-06-21

Instead of using NIO, try normal I/O operations i suggest. As "String fields = reader.readln();.. etc"
Some more things:
- Although split method is suggested, you can use StringTokenizer for performance needs. it does not use regular expressions. if you want to use split, compile the pattern in the class as
static final Pattern SPLITTER = Pattern.compile("[|]"); and use it.
- The performance difference is probably coming from the way Strings are encoded i assume. Try the same thing on an internationalized text.
- substring, + operations are costly. Beware of the interned Strings. ( http://mindprod.com/jgloss/interned.html )When you use "fields[0].substring(0,8);" the fields[0] is not garbage collected. it is a good thing generally but, this may cause mamory issues while processing huge files in a loop.
- your test case is a little unfortunate since you have only two parameters per line. results might be different if you use more paramteres to split, or a different structure without much String manipulation. String manipulation is expensive. C# is posibly using tricks and operating everything in ASCII internally. as i suggest, test with a utf-8 coded file..

krausest1
Offline
Joined: 2003-06-12

Some things I've noticed:
1. String concatenation like field1 += field2; is known to be problematic memory- and performancewise.
2. Using a regular expression where a simple indexOf("|") (maybe) would do the same might be slower too.
Apart from that I'd expect to see similar performance between c# and java (whether nio or not).

mas7871
Offline
Joined: 2005-11-07

Some updates to my testing... I forgot to remove a substring call. It was a specific function from the program I was writing and left it in by mistake. I tried to simplify the two programs for posting. I removed some more complex code and just decided to put the field1 += field2 code in just so that something was happening in the program.

For further testing I commented out the += code as well as replaced StringBuffer with StringBuilder and and came up with ~6.0 seconds for the BufferedReader and ~5.7 for the ByteBuffer code. Performance did therefore improve for the NIO code. Oh, the file I'm reading is 1,355,256 lines long.

I guess since the book hyped the NIO stuff I was expecting some significant improvement in performance. The book doesn't even cover the BufferedReader readLine method in any detail. It' shown once on page 1358 in a JDBC example and explains it to return a string from the file being read. I had to hunt for an example of an easier way to read strings from a file.

As for the C# stuff I was surprised. I made the same mods to the C# test (removed += code) and it only took ~1.25 seconds to process. I used .NET 2.0 beta 2 for the test. I couldn't believe the difference. I know it's only 5 or so seconds but I thought they would have been a little closer.

sdo
Offline
Joined: 2005-05-23

A few points on the last few posts:

s += "abc" is exactly equivalent to:
s = new StringBuilder(s).append("abc").toString();

Therefore, += isn't necessarily more or less performant than using string builder directly. However, using += in a loop or other construct where you only want the final string is the use case to look out for:

StringBuilder s = new StringBuilder();
for (int i = 0; i < 100; i++) s.append(i);
String t = s.toString();

is going to perform far better than
String t = "";
for (int i = 0; i < 100; i++) t += "" + i;

But a += inline of linear code isn't something to worry about.

Second, one area where Java has performance differences from other languages can be string processing because Java does all it's processing in two-byte characters rather than single-byte bytes. For those of us using ASCII locals, that includes translating to/from byte/char when doing I/O. I confess to not knowing anything about how c# works, but if it is byte-oriented like C/C++, that will explain a lot of the performance difference as well. [On the other hand, for the rest of the world, this trade-off works better.]

Scott

mas7871
Offline
Joined: 2005-11-07

Scott,

In my post dated Nov 9, 2005 10:58 AM, I stated that I removed the += and swapped StringBuffer out for StringBuilder all I was trying to say was I commented out the line with "field1 += field2;" so there wasn't any string concatenation going on. And someone had suggested trying StringBuilder instead of StringBuffer as it might be more performant.

The latest results I posted were for both files being read and the line is "split" and the data put into field1 and field2 respectively no other processing occures.

As for C# and strings... From what I read online C# stores all strings as unicode. I assume since it's a windows only environment it takes the unicode and converts the characters to pure ASCII for writing to Windows files. (I have tried writing files in C# and they write in ASCII)

olsonje,

I only used GregorianCalendar because I had originally used Date but wanted to subtract end time from start time. I was unaware of the System.currentTimeMillis() method at the time I was doing this. I'm new to Java so I'm not up on all the good stuff. I also agree about the depressing C# performance stats versus Java.

olsonje
Offline
Joined: 2005-08-10

Can you give an few line example from the file your reading so I can come up with a file like it to try it?

As for the time thing, in 1.5 there is a function in System that lets you get the time in nano's, which yeilds some interesting results findings out! :)

mas7871
Offline
Joined: 2005-11-07

The file contains data from work but it has two fields. The first is fixed at 11 characters and the second is variable length ranging from 13 to 22 characters. Something like:

ABCDEFGHIJK|123456790ABCEFG
LMNOPQRSTUV|FJEIBHLAKLDI
D209DNBNS3C|BJW903VNALJE0V
ALJ3OVNE2AL|SLASDLEP09AKLAP0
...

Of course I just put jibberish in but you get the idea.

sdo
Offline
Joined: 2005-05-23

It is true that NIO can be faster than traditional I/O, but in your case you're not likely to see a difference. In a program where the I/O operations constitute a significant amount of the processing, you could expect to see some difference. But in your case, most of the time is spent manipulating the strings. Once the strings have already been read, there's no advantage to using traditional or NIO.

In my mind, the complexities of NIO mean that you really have to know that your application is I/O bound (or at least a significant user of I/O) before using NIO over traditional I/O. On the other hand, that complexity can often be worth it: at JavaOne, we showed benchmarks of our NIO HTTP connector, which performs better than corresponding C-based HTTP connectors and far, far better than an HTTP engine based on traditional I/O. But an application like that, where I/O is particularly important, is one where you'd expect NIO to shine.

This is why, as a general rule, we urge developers not to optimize code prematurely. Write modular code, find out your bottlenecks, and focus on those rather than spending time up front writing more complex code.

Scott

mas7871
Offline
Joined: 2005-11-07

Scott,

Thanks for the quick response. I did think it was a lot of work to process a delimited text file using NIO.

The file that I read was 36 Meg. I used it as a small test to check performance of regular IO and NIO. Unfortunately it is just a small sample of this type of file. I will probably be reading 350 to 400 Meg files with 15 - 20 fields per line.

The sad part about my test was the C# result. The same file that Java read in 6 seconds only took C# 1.5 seconds.

sdo
Offline
Joined: 2005-05-23

In this case, I'd say that the size of the file doesn't matter; the ratio of processing compared to reading will remain constant.

For small batch programs, C# will probably be faster, but Java should be much more performant for your bigger test. One caveat is that Java on Windows platforms is designed to minimize footprint and startup time, but the performance of longer running programs isn't as optimal as it should be. Are you using any command-line arguments when you start java? You might try running with
java -server -Xmx1500m ...

-Scott

mas7871
Offline
Joined: 2005-11-07

Thanks for the suggestion. I did have a problem however. My PC has a 1.4.2 JRE which I need for web apps for work. When I installed the JDK 1.5 (5.0) version, I let it install the 1.5 JRE. When I ran your suggested -server -Xmx1500m options I got an error regarding no server JVM.

Error: no `server' JVM at `C:\Program Files\Java\jre1.5.0_05\bin\server\jvm.dll'

I looked in the directory shown and there isn't even a server directory for the JRE. When I look in the directory struction of the JDK JRE there is a server directory and when I execute you suggested JRE the performance does increase, the file takes about 4.8 seconds.

My next question is why didn't a server JVM get installed with the 1.5 JDK install? Oh, and there isn't a server JVM for my 1.4.2 install either. Is there a special option that I needed to select on the install? Is there a way to add the server JVM without an uninstall/re-intall cycle?

Also, thanks again for all your help.

Mike

felipegaucho
Offline
Joined: 2003-06-13

just set your environment variables and then try again:

SET JAVA_HOME=C:\Program Files\Java\jdk1.5.0_05
SET PATH=%PATH%;%JAVA_HOME%\bin
SET CLASSPATH=%JAVA_HOME%\jre\lib;.

java -server ......