Skip to main content

speed up matrix multiplication

9 replies [Last post]
rodes
Offline
Joined: 2008-06-04
Points: 0

Dear all,

I'm currently comparing the speed of a matrix multiplication written in Java with its counterpart in C++.
Here is the Java version:
[code]
public class MatrixDouble {
private final int numRows,numCols ;
private final double[][] data ;

public void times(MatrixDouble other,MatrixDouble result) {
... // some dimension tests here
double[] other_col_j=new double[this.numCols] ;
for(int j=0;j

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
mthornton
Offline
Joined: 2003-06-10
Points: 0

There are some examples here:

http://artisans-serverintellect-com.si-eioswww6.com/default.asp?W32

These are mostly intended for much larger matrices (e.g. 1000*1000) and multicore machines, yet some of the principles will still apply to smaller tasks.

I wrote this bit:
http://artisans-serverintellect-com.si-eioswww6.com/default.asp?W61

rodes
Offline
Joined: 2008-06-04
Points: 0

> There are some examples here:

Thanks, no doubt that recursive and parallel multiplication algorithms are faster than the naive algorithm. However, I'm more interested in the difference in speed between Java and C++. Have you compared your recursive multiplication implementation with a C++ implementation of the same algorithm?

By the way: it just tried my implementations with larger matrixes (size 1000). Interestingly, the difference between Java and C++ is much smaller in that case. I suppose that the memory bandwidth becomes critical for such matrixes (larger than the l1 and l2 caches), not the code generated by the compiler.

mthornton
Offline
Joined: 2003-06-10
Points: 0

No I have not done any comparison with C++. You could try my local optimised algorithm which adds a bit of loop unrolling to keep the CPU busy.

The recursive algorithm and variants is important when the matrix size exceeds the cache size --- the whole idea is to keep the subproblems within the cache. The parallel version achieves close to linear speed up on these large matrices, rather useful in these days when multicores are everywhere.

rodes
Offline
Joined: 2008-06-04
Points: 0

> You could try my local optimised algorithm which adds a
> bit of loop unrolling to keep the CPU busy.

Very good. Unrolling significantly improves the performance of the Java and C++ implementations. Here are my results for 50x50 matrices:

without (manual) unrolling:
C++: --- (SSE2 deactivated)
Java: +36% (runtime relative to equivalent C++ version)
with unrolling:
C++: -25% (runtime relative to C++ version without unrolling)
Java: +25% (runtime relative to equivalent C++ version)

For larger matrices (size 1000) the difference between no-unrolling and unrolling and the difference between C++ and Java become smaller. I suppose the programs are approaching the max. memory bandwidth. As you already wrote, it would be wiser to switch to more local algorithms in that case. For the sake of completeness, here are the results for 1000x1000 matrices:

without manual unrolling:
C++: --- (SSE2 deactivated)
Java: +18% (runtime relative to equivalent C++ version)
with unrolling:
C++: -4% (runtime relative to C++ version without unrolling)
Java: +7% (runtime relative to equivalent C++ version)

sprizor
Offline
Joined: 2008-07-14
Points: 0

Once you start doing larger matrices (1000 would qualify, 500 would show maybe 3-5% improvement) you're going to want to extract methods from your loops.

The JIT compiler will compile hot methods to native before cold methods, and right now most of your methods are cold. If you extract your loops into methods:

for(int i=0; i doSomething(...);
}

you will see improvement because the JVM will compile the majority of the work into native quickly, whereas leaving the hot work in the cold method will keep it interpreted for a longer period of time (perhaps indefinitely).

Again, you'll only start to see improvement when the loop length gets to be large. 500 showed me about 5% improvement on average, but I've done other math tests with around 10,000 iterations and gained 200% improvement.

ijuma
Offline
Joined: 2005-10-19
Points: 0

> Other "optimizations", such as using a one-dimensional
> array in the Java version, or using ugly pointer
> arithmetics in the C++ version, yield [i]slower[/i]
> code.

This is interesting. I am assuming that you have runnable code for both cases (one-dimensional array and two-dimensional array). It would be useful if you could post both versions so that others can inspect the code and validate the results (it's very easy to make mistakes when writing micro-benchmarks for Java code). Runnable code is best because then we can be sure that there are no differences.

Ismael

rodes
Offline
Joined: 2008-06-04
Points: 0

Here it is.
(is there a way to attach a whole file to a posting?)
The output should look like:
Time for 2D: 4297
8232600.0
Time for 1D: 4344
8232600.0
Note that I'm using the server JIT.

[code]
public class MatrixBenchmark {
public static void main(String args[]) {
run(50,20000) ;
}

private static void run(int size,int numIterations) {
MatrixDouble m1=createMatrix(size,size) ;
MatrixDouble m2=createMatrix(size,size) ;
MatrixDouble m3=new MatrixDouble(size,size) ;

long start=System.currentTimeMillis() ;
for(int i=0;i m1.times(m2,m3) ;
}
System.out.println("Time for 2D: "+(System.currentTimeMillis()-start)) ;
System.out.println(m3.get(2,3)) ;

MatrixDouble1D m11D=createMatrix1D(size,size) ;
MatrixDouble1D m21D=createMatrix1D(size,size) ;
MatrixDouble1D m31D=new MatrixDouble1D(size,size) ;

start=System.currentTimeMillis() ;
for(int i=0;i m11D.times(m21D,m31D) ;
}
System.out.println("Time for 1D: "+(System.currentTimeMillis()-start)) ;
System.out.println(m31D.get(2,3)) ;
}

private static MatrixDouble createMatrix(int rows,int cols) {
MatrixDouble m=new MatrixDouble(rows,cols) ;
int count=1 ;
for(int i=0;i for(int j=0;j m.set(i,j,count++) ;
}
}
return m ;
}

private static MatrixDouble1D createMatrix1D(int rows,int cols) {
MatrixDouble1D m=new MatrixDouble1D(rows,cols) ;
int count=1 ;
for(int i=0;i for(int j=0;j m.set(i,j,count++) ;
}
}
return m ;
}
}

class MatrixDouble {
private final int numRows,numCols ;
private final double[][] data ;

public MatrixDouble(int numRows,int numCols) {
this.numRows=numRows ;
this.numCols=numCols ;
data=new double[numRows][numCols] ;
}

public double get(int row,int col) {
return data[row][col] ;
}

public void set(int row,int col,double value) {
data[row][col]=value ;
}

public void times(MatrixDouble other,MatrixDouble result) {
//if(other.numRows!=this.numCols || result.numRows!=this.numRows || result.numCols!=other.numCols) {
// throw new IllegalArgumentException("matrix sizes not correct") ;
//}

double[] other_col_j=new double[this.numCols] ;

for(int j=0;j for(int i=0;i other_col_j[i]=other.data[i][j] ;
}
for(int i=0;i double[] this_row_i=data[i] ;
double sum=0.0 ;
for(int k=0;k sum+=this_row_i[k]*other_col_j[k] ;
}
result.data[i][j]=sum ;
}
}
}

public MatrixDouble times(MatrixDouble other) {
MatrixDouble result=new MatrixDouble(this.numRows,other.numCols) ;
times(other,result) ;
return result ;
}
}

class MatrixDouble1D {
private final int numRows,numCols ;
private final double[] data ;

public MatrixDouble1D(int numRows,int numCols) {
this.numRows=numRows ;
this.numCols=numCols ;
data=new double[numRows*numCols] ;
}

public double get(int row,int col) {
return data[this.numCols*row+col] ;
}

public void set(int row,int col,double value) {
data[this.numCols*row+col]=value ;
}

public void times(MatrixDouble1D other,MatrixDouble1D result) {
//if(other.numRows!=this.numCols || result.numRows!=this.numRows || result.numCols!=other.numCols) {
// throw new IllegalArgumentException("matrix sizes not correct") ;
//}

double[] other_col_j=new double[this.numCols] ;

for(int j=0;j for(int i=0;i other_col_j[i]=other.data[other.numCols*i+j] ;
}
for(int i=0;i double sum=0.0 ;
int ii=this.numCols*i ;
for(int k=0;k sum+=data[ii+k]*other_col_j[k] ;
}
result.data[other.numCols*i+j]=sum ;
}
}
}

public MatrixDouble1D times(MatrixDouble1D other) {
MatrixDouble1D result=new MatrixDouble1D(this.numRows,other.numCols) ;
times(other,result) ;
return result ;
}
}
[/code]

spdenne
Offline
Joined: 2006-02-15
Points: 0

Eclipse's compiler produces different results than Sun's compiler, so if you're looking for the fastest implementation then try both.

Repeat the timing of your 20000 iterations a few times.

Create other_col_j once per object instead of once per call to times().

Replace some of your loop bounds checks with the array.length. (Your 1D method doesn't seem to gain anything from that, but the 2D one does)

Original:

C:\workspaces\Testing>"\Program Files\Java\jdk1.6.0_06\bin\javac.exe" -d bin src\MatrixBenchmark.java

C:\workspaces\Testing>"\Program Files\Java\jdk1.6.0_06\bin\java.exe" -server -classpath bin MatrixBenchmark
Time for 2D: 6499
8232600.0
Time for 1D: 6590
8232600.0

New:

Time for 2D: 6159
8232600.0
Time for 1D: 6289
8232600.0
Time for 2D: 6149
8232600.0
Time for 1D: 6239
8232600.0
Time for 2D: 6119
8232600.0
Time for 1D: 6229
8232600.0

[code]
public void times(MatrixDouble other,MatrixDouble result) {
for(int j=0;j for(int i=0;i other_col_j[i]=other.data[i][j] ;
}
for(int i=0;i double[] this_row_i=data[i] ;
double sum=0.0 ;
for(int k=0;k sum+=this_row_i[k]*other_col_j[k] ;
}
result.data[i][j]=sum ;
}
}
}
[/code]

spdenne
Offline
Joined: 2006-02-15
Points: 0

You could also set sum to a better initial value than 0:

[code]
public void times(MatrixDouble other,MatrixDouble result) {
for(int j=0;j for(int i=0;i other_col_j[i]=other.data[i][j] ;
}
for(int i=0;i double sum=data[i][0]*other_col_j[0] ;
for(int k=1;k sum+=data[i][k]*other_col_j[k] ;
}
result.data[i][j]=sum ;
}
}
}
[/code]
Time for 2D: 6149
8232600.0
Time for 1D: 6279
8232600.0
Time for 2D: 6069
8232600.0
Time for 1D: 6218
8232600.0
Time for 2D: 6099
8232600.0
Time for 1D: 6249
8232600.0