Skip to main content

How about String, StringBuilder, etc implements Iterable

12 replies [Last post]
hnakamur
Offline
Joined: 2003-06-21

How about String, StringBuilder and StringBuffer implements Iterable to iterate over code points not chars?

To iterate over chars in a String, we used to write:

String str = "some string";
for (int i = 0; i < str.length(); i++) {
char c = str.charAt(i);
// Do something with c.
}

However these days we must iterator over Unicode 4.0 code points
to support surrogate pairs.

for (int i = 0; i < str.length(); i = str.offsetByCodePoints(i, 1)) {
int codePoint = str.codePointAt(i);
// Do something with codePoint.
}

But I think it is far more better if we can write like this:

for (int codePoint : str) {
// Do something with codePoint.
}

Calling remove() on a iterator returned by String will throw
UnsupportedOperationException. For StringBuilder and StringBuffer you can
write something like this:

@Test
public void testStringBuilderCodePointIteratorRemove() {
StringBuilder builder = new StringBuilder("\u3042\uD840\uDC0B\u3042");

int i = 0;
for (Iterator it = builder.iterator(); it.hasNext(); ) {
it.next();
if (i == 1) {
it.remove();
}
i++;
}

List codePointList = new ArrayList();

for (int codePoint : builder) {
codePointList.add(codePoint);
}

List expected = Arrays.asList(0x3042, 0x3042);
assertEquals(expected, codePointList);
}

If this style becomes popular, it means codes supporting surrogate pairs
becomes popular, which is good. Without this addition, most people tend
to write old style codes that does not support surrogate pairs,
which is not good. So I think this addition is important.
What do you think?

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
tackline
Offline
Joined: 2003-06-19

That's not a big win over:

[code]for (int cp : str.toCodePointArray()) {
/* bog standard loop body here */
}[/code]

or

[code]for (int cp : toCodePointArray(charSeq)) {
/* bog standard loop body here */
}[/code]

I think it is a poor idea, in general, to see closures (or concise anonymous inner classes) as a replacement for collection views and iterators. Iterators provide a handy implementation technique for algorithms. Collection views prevent an explosion of strangely named algorithm methods.

hnakamur
Offline
Joined: 2003-06-21

With toCodePointArray(), my example code will be:

StringBuilder sb = new StringBuilder();
for (int cp : str.toCodePointArray()) {
if (is_to_delete(cp)) {
continue;
}
sb.appendCodePoint(cp);
}
String result = sb.toString();

I think the functionality is the same here.

However the downside is allocating an array of code points of the original
String instance. If the original String length is long, I think
"for eachCodePoint" solution is better regarding to the memory consumption.

I think it might be natural to have toCodePointArray() alongside toCharArray(),
but we should keep APIs of standard classes minimum. The eachCodePoint closure
can be user defined, so I think it will do.

tackline
Offline
Joined: 2003-06-19

For a long String there would be a long int[], however:
* Your StringBuilder is likely to allocate about the same (in addition to StringBuilder.toString), and it will do that with a number of allocations and require copying.
* Strings aren't really suited to long pieces of text anyway.
* As well as the closure object, closures introduce a new class and hence add allocations to the permanent generation (which is expensive to maintain).
* If heavily used, a [i]sufficiently smart[/i] JVM would put the array on the stack.
* Closures need to be warm to be efficient. If you have lots of them, you will find cold code (in total) compromising performance (as in Swing, apparently).
* Most Strings are not long.

hnakamur
Offline
Joined: 2003-06-21

I agree with all points you made. So, the conclusion is adding toCodePointArray() to String, StringBuilder and StringBuffer, right?

What do we do now to add this method to JDK?

I read the page "Collaborating with Sun on JDK 7", and I realized I need to agree with JAVA RESEARCH LICENSE. But I am a application programmer, neither a university student nor a researcher. Can someone here add this method instead of me?

tackline
Offline
Joined: 2003-06-19

"Researcher" isn't in this case a job title. It's just something that describes what you do. Pretty much anyone can agree to the license (I guess subject to contracts you may be under (with your employer for instance) and local laws).

I am not a lawyer.

First step is to put a bug on the Bug Parade. This already exists.

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5063163

It has no votes, but a related bug currently has three.

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5003547

I have a registered agreement, so I can post a patch if you like.

hnakamur
Offline
Joined: 2003-06-21

Oh, I hadn't check the Bug Parade. I should've done it. Thanks for the links.

After reading, I had a sad feeling because I noticed those "bugs" are submitted on 2004, and states are still "In progress, request for enhancement" now. I don't think I can come up with any more enhancement than that.

As to CodePoint wrapper class, adding String.codePointCount() and changing String.length() to String.charCount(), I think it is too aggressive and also too late now (especially the last one). Adding toCodePointArray() does not break other APIs, so it will not bother anyone. I think this is a way to go.

Yes, please post a patch. Thank you very much for your time and kindness.

tackline
Offline
Joined: 2003-06-19

I have posted a patch. If you have signed and returned the relevant agreement, the relevant jdk-collaboration thread is:

https://jdk-collaboration.dev.java.net/servlets/ProjectForumMessageView?...

hnakamur
Offline
Joined: 2003-06-21

I have read the thread. Thank you for posting the patch. I think it is a good patch. I am impressed.

Also I have added my vote to the bug in Bug Parades. I don't know it matters much though.

Is there anything we can do for the patch to be integrated? It is OK just waiting?

tackline
Offline
Joined: 2003-06-19

Why use direct interface inheritance? Would it not be less confusing to have a method that returned a read-only List of code points?

In any case, the boxing overheads could be high. Might as well just add a toCodePointArray method to go with toCharArray.

hnakamur
Offline
Joined: 2003-06-21

If you returned a read-only List of code points, you could not call it.remove() to remove a code point from a StringBuilder or StringBuffer.

I understand the boxing overheads could be high. For performance critical code, you can write using codePointAt() and offsetByCodePoints(). The code below illustrates removing a surrogate code point from a StringBuilder.

@Test
public void testJavaLangStringBuilderSupportRemoveSurrogate() {
StringBuilder builder = new StringBuilder("\u3042\uD840\uDC0B\u3042");

for (int i = 0; i < builder.length(); i = builder.offsetByCodePoints(i, 1)) {
int codePoint = builder.codePointAt(i);
if (codePoint == 0x2000B) {
// remove the current code point from StringBuilder
builder.delete(i, builder.offsetByCodePoints(i, 1));
}
}

String expected = "\u3042\u3042";
assertEquals(expected, builder.toString());
}

I think it is easier if I could write like this:

@Test
public void testStringBuilderCodePointIteratorRemove() {
StringBuilder builder = new StringBuilder("\u3042\uD840\uDC0B\u3042");

for (Iterator it = builder.iterator(); it.hasNext(); ) {
int codePoint = it.next();
if (codePoint == 0x2000B) {
it.remove();
}
}

String expected = "\u3042\u3042";
assertEquals(expected, builder.toString());
}

The point here is whether it is easy for novice programmers. Yes, the latter code has overheads of creating an iterator instance and boxing. A veteran programmer can write the former code to avoid this.

Maybe someday, performance penalty will be negligible by javac and JIT optimization.

I think adding toCodePointArray might not help much in the above case because it will return a copied array so you cannot remove a code point from the original StringBuilder instance.

gafter
Offline
Joined: 2003-08-29

Wouldn't you rather have closures in the language and an API method that allows you to conveniently iterate over the code points in any CharSequence:

for eachCodePoint(int cp : charSeq) {
// loop body here, using the variable cp
}

hnakamur
Offline
Joined: 2003-06-21

I think your example is fine to iterate, but you cannot remove a code point from the CharSequence during iteration.

But maybe it is not a problem after all. You can create a StringBuilder instance for the new result.

StringBuilder sb = new StringBuilder();
for eachCodePoint(int cp : charSeq) {
if (is_to_delete(cp)) {
continue;
}
sb.appendCodePoint(cp);
}
String result = sb.toString();

Now I am convinced to withdraw my proposal. Thanks for your replies.