Skip to main content

True multidimensional arrays

15 replies [Last post]
opinali
Offline
Joined: 2003-06-17
Points: 0

My #1 wish for Mustang is simple: "int[2,3] matrix". I was once in favor of JSR-83 (Multiarray package), but now that the Java language is more open to syntax improvements, we can just drop that JSR. We don't need a full-blown Fortran clone (BLAS libraries, etc.) bundled with every JRE, less than 1% of Java apps need this kind of functionality. We need only the most fundamental feature: multidimensional arrays in the core typesystem.

Sophisticated numerical APIs like JSR83's, Jakarta Commons Math's and others, could then be implemented in PureJava with top performance, without any further JVM support, for those needing such APIs. But the core array feature is useful for ALL non-trivial Java apps, even those not doing any numerical work -- MD arrays are better than square arrays-of-arrays, even for non-numerical content.

Now, a more concrete tentative specification.

Java's current arrays have this layout:

[header, size, elem0, elem1, .., elemN], N=size-1.

I propose this layout for multidimensional arrays:

[header, size, elem0, elem1, .., elemN, dims, dimsize0, dimsize1, .., dimsizeM], M=dims-1

Where "dims" is the number of dimensions for the array. For example, the array "int[2,3]" would be stored as:

[header, 6, elem0, elem1, .., elem5, 2, 2, 3]

Where elem0 = [0,0], elem1 = [0,1], elem2 = [0,2], elem3 = [1,0], elem4 = [1,1], elem5 = [2,3]. An index expression such as array[a,b] would be mapped by the compiler to *(&elem0 + 3*a + b), as usual in languages with true multidimensional arrays.

The advantage of storing the new fields for #dimensions and size of each dimension in the end of the array instead of the beginning is compatibility. It's less changes to existing JVMs, reflection APIs, etc. Even the traditional "full size" field can be kept, even if redundant. The object header may need an 1-bit flag to specify if the array is traditional or multidimensional, i.e., if the new dimension/dimension-size fields are present, so we don't need to add these fieidls for 1-dimension arrays.

We could even support compatibility between MD arrays and traditional arrays:

int[2,3] mdArray = new int[2,3];
int[] oldArray = (int[])mdArray;
oldArray[5] = 7; // changes mdArray[1,2]

The code above is easy and safe to implement. But:

int[] oldArray = new int[6];
int[2,3] mdArray = (int[2,3])oldArray; // not allowed!

This case is not allowed because it creates problems with my next request: conformant arrays.

"Conformant" (or "open") arrays have dimensions that may be dynamically inspected. Java's current arrays are conformant: aray.length gives the size. Multidimensional arrays must be conformant too:

int[3,4] mdArray;
mdArray.dims -> 2 // reads field: dims
mdArray.length -> 3 // reads field: dimsize0
mdArray[].length -> 4 // reads field: dimsize1
mdArray.length -> 1

The above is just one possible suggestion for how to access the number of dimensions and the length of each. There are other possibilities, like:

mdArray.lengths[0] -> 3 // reads field: dimsize0
mdArray.lengths[1] -> 4 // reads field: dimsize1
mdArray.lengths.length -> 2 // reads field: dims

In this second version, I only add one meta-field lengths, less new stuiff than the first proposal (which requires a meta-field 'dims' plus new syntax like mdArra[].length to fetch the length of the second dimension). Do whatever is easier. Maybe we could keep "mdArray.length" returing the "full size" (like 3*4=12 in this example), to make some single/multi-dimensional programming tricks easier, more compatible, or more efficient, like:

int[] data = new int[mdArray.length]; // new 12-elem array
System.arraycopy(mdArray, 0, data, 0, data.length);

Finally, support for array sizes in the static typesystem is another very important improvement. Example:

double[4,4] matmult (double[4,4] a, double[4,4] b);

The method above implements the matrix multiplication algorithm, optimized for 4x4 matrices, which are wildly popular in 2D & 3D graphics code for example. With the specific dimensions specified in parameters and return types, the JVM doesn't need to do ANY bounds checking for constant indexes (the matmult algorithm above can be implemented using only constant indexes, or if you prefer, with two 'for' loops ranging from 0..3 so the compiler can still prove statically that all indexes are valid). The index-check would be required only at method call or at assignment of the return, and this check could be usually performed at source compilation time. The single scenario that requires dynamic checks is when using the next feature:

double[,] matmult (double[,] a, double[,] b);

The method above performs multiplication of matrices of any size. The code will use of the dimension-specific length fields, so it will do loops like

for (int x = 0; x < a.lengths[0]; ++x)
// operate on a[x], etc.

...so the optimizer can remove bounds-checks easily, as [0..a.lengths[0]) is precisely the valid range for a[N] and so on. It's only slightly less efficient than a version tuned for specific size, because we need loops, but we don't need any bounds checks, neither expensive optimizations to remove bounds-checks (loop unrolling, moving the checks to method header, loop versioning so an eventual exception happens exactly when required by the JLS, etc.).

This feature can get even better:

double[A,C] matmult (double[A,B] a, double[B,C] b);

In this version, the identifiers A/B/C are constraints similar to type variables in generics. The methods is constrainted to multiply only pairs of arrays such that a's second-dim length is equal to b's first-dim length. The resulting array's dimensions are also constrained by the dimensions of the arguments. The implementation of this methods can use the identifiers A,B,C (which map to 'int' values) in loop bounds, array creation expressions, etc. so we don't need any dynamic lookup of the array's dimension-length fields... and the *single* overhead of this implementation, compared to one optimized for a fixed size, is the looping structures.

The language would allow all sensible combinations of conversions, like:

int[3,4] a;
int[,] b, c;
b = a; // Ok, no checks required
c = new int[9,1];
a = (int[3,4])b // Ok (with dynamic bounds-check)
a = b; // compile-time error, requires cast
a = (int[3,4])c; // Compiles, but throws index exception

***

I think this proposal is hugely useful. Multidimensional arrays save space and time with their much simpler memory layout and initialization (not need of nested for loops just to init all dimensions). JIT compilers can generate optimizal code with less effort (better loading time). The "flat" memory layout produces much better performance also due to CPU caches. Less overhead goes to GC too, because arays-of-arrays contain a tree of pointers to the sub-arrays, and they're split in a large number of small objects; in comparison, a multidimensional array of any size doesnt' contain any reference, and it's made of one
single big object, so in the GC, both the mark and copy steps are more efficient. Multidimensional arrays avoid virtually all the bounds checks, and all the nullpointer
checks (for nested arrays), that currently make difficult for JIT compilers to generate optimal code for arrays. Finally, full support for conformant arrays, including syntaxes for arrays with specified sizes (a[3,4]), unspecifed sizes (a[,]) and constrained sizes (a[A,B]), will do for arrays what generics do for other codes, with additional performance benefits.

The proposal can be implemented with a very backwards compatible memory layout for MD arrays; no new keywords; very modest syntax extensions (in the same ballpark of most features from JSR-201); and the new arrays can be made hugely compatible with existing APIs (expecting traditional arrays), even though some new APIs would be desirable to provide specific support for MD arrays.

And nothing here is rocket science, programming languages have been implementing multidimensional arrays for ages, even with features that go much further than this proposal (e.g. triangular or sparse arrays; group operations on entire arrays, rows or columns, and so on). In fact, the current top-notch JIT compilers, like HotSpot, have array handling code that's more complex than everything I'm requesting here, because they implement optimizations like bounds-check removal and loop transformations, that are
very difficult to do with arays-of-arrays, but much easier with MD arrays, and trival with constrained MD arrays.

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
lucretius2
Offline
Joined: 2004-12-19
Points: 0

Sorry, let's try that again...

I'm strongly in favour of 'access overloading' as a controlled form of operator overloading - unlike C++'s. It's controlled because the operator methods are defined in a special interface (which would be in java.lang.*). So the interface's contract would be designed to prevent the abuses seen in C++ operator overloading: for instance, calling a.get(i) or (equivalently) a[ i ] would have to return the value n last passed to a.set(i,n) or (equivalently) a[ i ] = n.

However Access/Operator overloading won't give the speedups that scientific computing needs, and that true rectangular arrays would (apparently) give.

So these 4 are independent but work well together:
* multidimensional arrays
* limited operator overloading
* lightweight value types
* arrays implement the List interface

laurapiersol
Offline
Joined: 2003-06-13
Points: 0

I'm not sure you can state as fact that 'access overloading' wouldn't give the speed up Scientific Computing desires. From what I can tell 'access overloading' has the ability to give superior performance on multiprocessor systems where synchronization is an issue. Data could be stored as single dimensional array such as [row * colCount + col]. How math is handled the most effeciently on the data would be up to the implementation. Sun could implement native BLAS library if desired and have it as an add on package like 3D is.

In my humble opinion 'access overloading' is not the same as 'operator overloading'. There is no 'operation' (in the numeric sense) being performed during access overloading and no side effects. 'Operator overloading' has many side effects and I'm against it in practicality even though in theory I think it would be nice.

Arrays shouldn't implement the 'List' interface they should implement the 'Array' or 'GetSet' interface. 'List' should implement the 'Array' interface. The whole 'List' interface hierarchy seems a little off to me but it is what it is. I do have great respect for the people who created it but if I were to make an access hierarchy it would be more like:

interface Array -> T get(int index); int size();
interface MutableArray extends Array -> void set(int index, T value);
interface List extends MutableArray -> add(T value); remove(int index);

Lightweight value object would be great from a performance point of view (today) but they don't make Java clearer or easier to program. They just add complexity that everyone will have to deal with. It would be great if the JVM could handle the problem internally. I mean java bytecode used to be 100 times slower 5 years ago but now is on par with c/c++ according to some sites. Don't we have hope that one day soon the 'new' operator will be as quick and efficient as creating something on the stack? Maybe not but I'm sure hoping so. This is one case where I believe Sun is going to opt for simplicity and consistency over performance.

lucretius2
Offline
Joined: 2004-12-19
Points: 0

I'm strongly in favour of 'access overloading' as a controlled form of operator overloading - unlike C++'s. It's controlled because the operator methods are defined in a special interface (which would be in java.lang.*). So the interface's contract would be designed to prevent the abuses seen in C++ operator overloading: for instance, calling a.get(i) or (equivalently) a[i] would have to return the value n last passed to a.set(i,n) or (equivalently) a[i] = n.

However Access/Operator overloading won't give the speedups that scientific computing needs, and that true rectangular arrays would (apparently) give.

So these 4 are independent but work well together:
* multidimensional arrays
* limited operator overloading
* lightweight value types
* arrays implement the List interface

opinali
Offline
Joined: 2003-06-17
Points: 0

The wrapping solution works, of course, and JSR-83 was based on this idea: a large library of matrix classes with every feature you may need. But there are problems.

First, code bloat. You need a new array class of every basic type (all primitives + Object). With generics you can avoid a new array class for every class (e.g. MultiArray2) but under the current generics spec, there is a performance hit because element reads will be compiled with typecasts. And of course, you need a new array class for each number of dimensions: MultiArray2, MultiArray3, etc. If we multiply both factors, even if supporting only up to 4 dimensions, that's a huge number of classes, even if they're all simple.

Second, standardization. If this library is not included in the JRE, different apps and software components will use different multidimensional array libraries (and most will just use jagged arrays), so it's not a very seamless programming environment. Example: Java3D contains matrix classes, in the package javax.vecmath. But does ANYBODY else uses the same classes for matrices? Nope - not even JOGL, which is also a 3D graphics API. Not to mention numerical APIs and others having no relation with 3D graphics. Of course javax.vecmath's matrices are not general-purpose matrices, but you get the point.

Third, ease-to-use. It's already bad enough that I cannot write "a + b" for BigDecimal, BigInteger and other user-defined types -- it's a completely different RFE whether this should be fixed or not, I'm also a survivor of C++ and afraid of the language complexy that operator overloading can add to the language... but at the very least, don't make it worse. Like having to write "a.set(2,3, b.get(4,1))" instead of "a[2,3] = b[4,1]" or even instead of "a[2][3] = b[4][1]". We don't need this kind of step backwards.

Fourth, performance. With "native" matrices, the semantics is cast in stone and well-known by the JVM, so additional optimizations are possible. OTOH, this benefit can be applied to a library of matrix classes that are also made core J2SE APIs (IIRC IBM implemented such optimizations in their research JVM implementing Ninja/JSR83). But since you need to enhance the JVM to obtain top performance anyway, why not doing a decent job and implementing true multidimensional matrices?

mgrev
Offline
Joined: 2003-08-12
Points: 0

Exactly my thought as well,

arr[1, 2]
is same as
arr[1 + 2 * width]

However, since adding it would only be syntactic sugar, it would be easy to add and not break anything..

Whatever.. :)

Cheers,
Mikael

sionep
Offline
Joined: 2004-03-04
Points: 0

In terms of scientific computing , yes we do need them.

opinali
Offline
Joined: 2003-06-17
Points: 0

> I feel completely sure that less than 1% of Java
> applications need multi-dimensional array support
> beyond the int[][]..[] support that already exists. I

My point is that 99% of nontrivial Java apps use square arrays-of-arrays, and we seem to agree up to this point; now ALL these apps (if updated to use MD arrays) would become simpler, run faster and use less memory. And like others noticed, even if you don't use arrays of any kind, or use only a few small arrays so the benefits would't appear in your radar, you may depend on libraries that use large/complex arrays and you would benefit indirectly.

laurapiersol
Offline
Joined: 2003-06-13
Points: 0

I'd like to suggest the easiest solution I've been recommending since jdk 1.0 that has NO (that's the big ZERO) side effects. ACCESS OVERLOADING. Let [i, j, k, ...] map to get(i, j, k, ...) and set(i, j, k, ..., value). No change in JVM and much more useful for Scientific computing.

Matrix and Vector classes come in many different forms (dense, sparse, identity, ...etc) so there would be no way to imbed all the different forms in the language or bytecode. Just add ACCESS OVERLOADING and if REALLY ambitious add some interfaces like:

interface Matrix {
int size(int dimension);
T get(int... dimensions);
void set(T value, int... coordinate);
}

List interface could even implement this interface. And maybe for efficiency something like:

interface DMatrix {
int size(int dimension);
double get(int... dimensions);
void set(double value, int... coordinate);
}

No, I'm not advocating that particular interface just saying that a simple interface that ONLY does access overloading could be created. Shouldn't be a huge interface that try's to do everything.

Now we could do this:

DMatix a = new DenseDMatrix(2, 2, 2); // 3 dimensional.
double value = a[1, 1, 1];
a[0, 0, 0] = value;
int dim2 = a.size(1);
....
for (int i; i < a.size(0); ++i) {
for (int j; j < a.size(1); ++j) {
for (int k; k < a.size(2); ++k) {
a[i, j, k] = 1.0;
}
}
}

Very simple and solves most of the problem. Light weight VALUE OBJECTS should really be handled separatley.

markf
Offline
Joined: 2005-01-20
Points: 0

Put me down for some of those multidimensional arrays, and lightweight objects too. Sounds good to me.

lucretius2
Offline
Joined: 2004-12-19
Points: 0

I'm all in favour of multidimensional arrays, because adding a new type of array should enable other improvements which can't quite be done with existing arrays:

1. Make arrays fully-fledged Collections (and integrate them with generics):
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4654312

2. New arrays could support one of the various proposals for lightweight objects. If you have huge array of Points, say, it's much more efficient for each Point to be stored 'inline' in the array instead of as separate objects referred to by the array. This fits well with schemes which disallow null references to lightweight objects.

IMHO a new type of array would have enough new benefits that existing arrays could be semi-deprecated, though obviously they should easily interoperate with the new ones.

lucretius2
Offline
Joined: 2004-12-19
Points: 0

BTW the classic example showing the need for multidimensional arrays and lightweight objects in combination, is a matrix of complex numbers. If these were stored as a block of memory containing the Complex fields directly (not references to separate objects), it would be very efficient.

Java at the moment would handle these very inefficiently:

* Existing arrays allow the possibility of aliasing, because in an array-of-arrays, two subarrays might be the same object. This creates CPU bottlenecks.

* Wasted memory because you need both an array of references, and a separate object for each reference to refer to, with extra headers

* Wasted time because you need to create, and later garbage collect, each separate Complex object.

mayhem
Offline
Joined: 2003-06-11
Points: 0

Why can't you just wrap a one dimensional array in a class and provide methods to make it look like it's multi dimensional?

afreije
Offline
Joined: 2004-10-14
Points: 0

I second this wrapping idea. That way it is very easy to do light weight slicing and sectioning.[/i]

keithkml
Offline
Joined: 2003-06-10
Points: 0

You say we don't need a full-blown FORTRAN clone because "less than 1% of Java apps need this kind of functionality."

I feel completely sure that less than 1% of Java applications need multi-dimensional array support beyond the int[][]..[] support that already exists. I know I've never written a Java application that needed this.

jarouch
Offline
Joined: 2004-03-04
Points: 0

Maybe you don't use multi-dimensional arrays directly, but things like graphics are full of them. So if this feature brings faster access and smaller matrixes, it would lead to applications with lesser memory footprint and better GUI for everyone, for example..