Skip to main content

A lot of warning messages along with high CPU utilization on RdvRelay node

14 replies [Last post]
taoenator
Offline
Joined: 2007-10-25

On a Rendezvous/Relay node utilizing JXSE, I see a lot of these warnings and stack traces in the console log. The problem is not these messages per se, but the fact that whenever I see these messages I also notice that the CPU usage goes way high. Can someone explain why I am getting these warnings and if there are things to try to avoid them? Also, are there things I can do to get rid of bad entries in the index?

net.jxta.impl.cm.SrdiIndex readRecord
WARNING: Exception while reading Entry
java.io.UTFDataFormatException: malformed input around byte 32
at java.io.DataInputStream.readUTF(Unknown Source)
at java.io.DataInputStream.readUTF(Unknown Source)
at net.jxta.impl.cm.SrdiIndex.readRecord(SrdiIndex.java:659)
at net.jxta.impl.cm.SrdiIndex$GcCallback.indexInfo(SrdiIndex.java:533)
at net.jxta.impl.xindice.core.filer.BTree$BTreeNode.query(BTree.java:1043)
at net.jxta.impl.xindice.core.filer.BTree$BTreeNode.query(BTree.java:1037)
at net.jxta.impl.xindice.core.filer.BTree$BTreeNode.query(BTree.java:1037)
at net.jxta.impl.xindice.core.filer.BTree.query(BTree.java:289)
at net.jxta.impl.cm.SrdiIndex.garbageCollect(SrdiIndex.java:722)
at net.jxta.impl.cm.SrdiIndex.run(SrdiIndex.java:824)
at java.lang.Thread.run(Unknown Source)

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
hamada
Offline
Joined: 2003-06-12

I would be interested in a consistency test. Can you run the consistency check on the database? Look for test/test/CheckSrdi.java under the source tree.

taoenator
Offline
Joined: 2007-10-25

I searched and cannot find CheckSrdi.java. Are you sure it's part of the standard distribution or svn trunk?

hamada
Offline
Joined: 2003-06-12

regarding ephemeral pipe leaks on JxtaSockets and JxtaBidiPipes, I committed a patch this morning, after our discussion last night, which should address the pipe entry leaks.

Please try out the patch and let us know how it goes.

taoenator
Offline
Joined: 2007-10-25

I tried the patch. Unfortunately it doesn't fix our particular problem - we are still seeing stale SRDI entries for each attempted JxtaSocket connection that resulted in SocketTimeoutException. Before SocketTimeoutExceptions are thrown, localEphemeralPipeIn needs to be unregistered from the pipe registrar so the cooresponding SRDI entries are removed (i.e., localEphemeralPipeIn.close()).

taoenator
Offline
Joined: 2007-10-25

I think I found the root cause but I don't have a quick solution. Please advise if you know what I should do.

The root of the problem is that every time a JxtaSocket is created, a new Ephemeral Pipe Advertisement is created:

public void connect(PeerGroup group, PeerID peerid, PipeAdvertisement pipeAdv, int timeout) {
...
this.localEphemeralPipeAdv = newEphemeralPipeAdv(pipeAdv);
...
}

The Ephemeral Pipe Advertisement - though the field name says "local" - is somehow pushed to the Rdv node. Because a Ephemeral Pipe Advertisement has a random ID, even though the pipeAdv we generate are constant, a new srdi entry is created on the Rdv node every time we establish a new JxtaSocket. Since we create a lot of short-lived JxtaSockets for quick transactions, we end up having a ton of srdi entries. With 10-15 clients, we end up with a pipeResolverSrdi.tbl in the srdi directory that is 40MB+.

Questions:
* Are we doing anything wrong? Are we using JxtaSockets in a way that's not intended?
* What can we do to alleviate this situation without a lot of code change?
* Should these Ephemeral Pipe Advertisements be pushed onto the Rdv node? What can we do to minimize that?

thenetworker
Offline
Joined: 2003-06-13

This is a known issue. I published the problems and the solutions here https://jxta.dev.java.net/servlets/ReadMsg?listName=dev&msgNo=360

The solutions are in the issue tracker 103. You may ask the JXTA product management for the timeline of the targeted milestone.

To answer your questions directly, I don't think you are doing anything wrong based on your description.

You can not alleviate the situation at your application level without compromising your objective of using numerous sockets, thus numerous advs. Fortunately, you may use the code (unofficial beta) in the issue 103 most likely next week.

Maybe it is not a good idea to push the ephemeral advs to the RDV. But the problem has been solved in another way. The new design in the issue 103 will allow you to do the RDV traffic shaping. So, the massive adv push from a large number of peers will never drive your CPU consumption to the roof.

bbwarm
Offline
Joined: 2008-03-21

Thank you for the very helpful reply. I'll look into using a build that includes a fix for issue 103 in the near future.

We have made more progress in finding the root cause of this problem, and we believe it is bug in the Jxta code. The observation is that we get residual srdi entries for the ephemeral pipe advs in both local pipeResolverSrdi.tbl and remote (i.e., RDV) pipeResolverSrdi.tbl. These entries does not go away as we expect them to. Because we tend to have long running clients, we get huge build-up of these ephemeral pipe advs in both the local and RDV nodes.

It turns out that when a "Connection (resolution) timeout" happens, a SocketTimeoutException is thrown but previously established resources such as those in bind() aren't cleaned up. In JxtaSocket.java:

public void connect(PeerGroup group, PeerID peerid, PipeAdvertisement pipeAdv, int timeout) throws IOException {
...
...
bind();
...
...
if (connectOutpipe == null) {
throw new SocketTimeoutException("Connection (resolution) timeout");
}
...
...
}

Bind() actually registers the ephemeral pipe and pushes the srdi record to the RDV node. In the above case, it appears to be that an unbind() should happen right before throwing the exception. Of course, unbind() has assumptions on established resources so you'll get NullPointerExceptions if you use it as is.

At the least, the following should be done:
if (connectOutpipe == null) {
if (null != localEphemeralPipeIn)
localEphemeralPipeIn.close();
throw new SocketTimeoutException("Connection (resolution) timeout");
}

From looking at the code it seems that slightly later in that block of code, a clean up is needed when a "Connection timeout (connect)" happens:
if (!isConnected()) {
if (null != localEphemeralPipeIn)
localEphemeralPipeIn.close();
throw new SocketTimeoutException("Connection timeout (connect)");
}

Does this make sense? If so, is the above changes sufficient as a temporary fix for us until the proper fix is put in?

If this is not a known bug, I will report it. Would someone please send me a pointer?

Theoretically, the local srdi record shouldn't even be pushed onto RDV if a connection timeout happens.

thenetworker
Offline
Joined: 2003-06-13

"...we believe it is bug in the Jxta code. The observation is that we get residual srdi entries ..."

Actually, it is a bug. I published it as the issue 104 ( https://jxta-jxse.dev.java.net/issues/show_bug.cgi?id=104 ). The garbage collection thread is not running at all. That is on the RDV.

In fact, there is another related garbage collection bug on the EDGE.

So, in short, the entire garbage collection of JXTA is not working right now. It should be fixed in the next release. Most users won't notice the bugs because whenever the RDV restarts, all the data will be destroyed and rebuilt. The search on the EDGE will simply skip the garbage data even if they are still in the cache.

Your application appears to use more advs than others. As a result, the ugliness of the bugs manifests itself.

taoenator
Offline
Joined: 2007-10-25

Thanks again for the answer. So, does this mean we should restart the RDV node once in a while, just to keep the srdi cache lean and efficient?

thenetworker
Offline
Joined: 2003-06-13

No. Restart won't help you. In fact, it will do the opposite. As I described it here https://jxta.dev.java.net/servlets/ReadMsg?listName=dev&msgNo=360, the current implementation of the RDV is that it will trigger SRDI push from all the reachable edge peers, thus inviting a self-inflicted DOS attack. That is an architectural design flaw.

As far as I can see, you have two options for your specific situation:

- avoid RDV for now, if you can. Do you have to use the RDV for your application?

- start to use the fix in the Issue 103 for future development.

hamada
Offline
Joined: 2003-06-12

Sounds like you have a corrupted database. remove the cm directory and run file system checks. This could also be caused by running multiple instances within the same configuration (JXTA_HOME)

taoenator
Offline
Joined: 2007-10-25

I am unable to resolve this problem. We are not running multiple instances. We are running a single instance which acts as both a Rdv and a Relay on this server. There are about 20-30 Edge nodes that connect up to this RdvRelay node. The Edge nodes do move around through different networks.

Regardless what I do, these warning messages keep coming back. Here are more observations. Whenever I cleaned the cm directory and restart the RdvRelay node, the pipeResolverSrdi.tbl file ( and the associated offset file) quickly get up to 20-30MBs in size and keep growing. At one point, I found them at 150MB each.

Is this normal? Are we possibly mis-configuring settings on the RdvRelay node or the Edge nodes?

bondolo
Offline
Joined: 2003-06-11

> Regardless what I do, these warning messages keep
> coming back. Here are more observations. Whenever I
> cleaned the cm directory and restart the RdvRelay
> node, the pipeResolverSrdi.tbl file ( and the
> associated offset file) quickly get up to 20-30MBs in
> size and keep growing. At one point, I found them at
> 150MB each.
>
> Is this normal? Are we possibly mis-configuring
> settings on the RdvRelay node or the Edge nodes?

It is not normal and not likely caused by mis-configuration. I have only ever seen the files grow above 20Mb when they are corrupted. One possibility--are you making sure that the Java runtime has sufficient file handles using ulimit (on Solaris or Linux)? We have seen unusual failures when the JVM runs out of file handles.

taoenator
Offline
Joined: 2007-10-25

We tried setting ulimit to a very high amount and tuning other factors on the OS but none of them seems to help. The ulimit the RDV node is using appears to be relatively small (at about 600).

This particular server has a public IP address that is different than the private IP address, and we have to configure JXTA to use the public IP address. Could that be a possible cause?

Anyway, here's what our RDV program distills down to. Do you see any issues?

==================
NetworkManager manager = new NetworkManager(NetworkManager.ConfigMode.RENDEZVOUS_RELAY, NAME);
NetworkConfigurator config = manager.getConfigurator();
config.setHome(DIRECTORY);

if (config.exists()) {
config.load();
} else {
config.setName(NAME);
config.setPrincipal(PRINCIPAL);
config.setPassword(PASSWORD);
config.setDescription(CONFIG_DESC);
config.setTcpEnabled(true);
config.setTcpIncoming(true);
config.setTcpOutgoing(true);
config.setUseMulticast(false);
}

config.clearRendezvousSeeds();
config.clearRendezvousSeedingURIs();
config.clearRelaySeeds();
config.clearRelaySeedingURIs();
config.addRdvSeedingURI(RDV_SEEDING_URI);
config.addRelaySeedingURI(RELAY_SEEDING_URI);

// We're using NAT
config.setTcpPublicAddress(PUBLIC_IP_ADDR + ":" + 9701, true);

config.setTcpStartPort(-1);
config.setTcpEndPort(-1);

config.setHttpPublicAddress(PUBLIC_IP_ADDR + ":" + 80, true);

config.setHttpPort(80);

config.setHttpIncoming(true);
config.setHttpOutgoing(true);

config.save();

...
...

final RendezVousService rendezvousService = netPeerGroup.getRendezVousService();
rendezvousService.startRendezVous();
======================

This issues is blocking our progress. We are at a lost here as to how to deal with this problem. Your help and guidance is greatly appreciated.