Skip to main content

reliability, scalability framework

24 replies [Last post]
jfkavaka
Offline
Joined: 2005-01-13

Hi all,

I am trying to get a first grasp on the SLEE spec and the mobicents implementation. (By the way I would like to congratulate all the development team for this achievement!)

A few questions regarding reliability and scalability:
Are there any plans to develop a fault tolerance and scalability framework for mobicents, such as OpenCloud's Savanna?
Can JBoss clustering be used with mobicents?
Can SBBs in different SLEEs (in different JVMs/machines) interact in a EJB-like fashion? How?

Thank you,
Placido

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
buzzheavyyear
Offline
Joined: 2005-06-18

Just to add/request that any implementation of transport for session/state/fault_tolerance/HA between slee nodes is not hard-wired in (ie with something like jgroups). I've come across quite a number of instances where we were unable to implement fault tolerance between servers (jboss etc) because the transports available weren't standard IP networks.

The kind of interface I have in mind is something similar to the hedera library, which is used by the clustered jdbc sequoia project (https://forge.continuent.org/projects/sequoia/). The default sequoia transport is jgroups.

Nick

ivelin
Offline
Joined: 2003-07-13

Nick,

Please expand on your idea to introduce a caching abstraction layer to allow multiple implementations to be plugged in as needed.

Ivelin

buzzheavyyear
Offline
Joined: 2005-06-18

Ivelin,

Best thing to do is to have a quick look at a short description of hedera:

http://hedera.continuent.org/HomePage

The hedera lib consists of 28 classes (including the jgroups implementation). It essentially provides a messaging framework between nodes, providing factories for plugin transports. If a new transport needs to be introduced, then only four new classes need to be written (mostly boiler plate, cut and paste). I've already done three different transports for various clients and each one took less than a day to implement and test.

So, sessionid's/state_info etc can be distributed across nodes very quickly.

The primary reason for proposing this is that I can see mobicents being integrated with a lot of frameworks now, and in the future, most of which already have either commercial transports, opensource transports or homebrew transports for shifting state/session/data between nodes.

Incidently, this is one of the main drawbacks of using JBoss Cache at the moment as JGroups is hardwired into it. I'll probably have a go at refactoring part of JBoss Cache (if they let me, or better still, do it themselves!) so that either hedera or something similar can be used in the ReplicationInteceptor (http://jira.jboss.com/jira/browse/JBCACHE-311). If you have any influence over this, perhaps you might be able to persuade the JBoss Cache guys to get this in for the JBossCache 1.3 release at the end of February?

So, in relation to fault tolerance, when a node fails and then recovers, the sessionid's and state are restored via a neutral messaging framework. The 'fault detector' and 'discovery service' is there to aid any tiered client to locate a functioning node.

Hope this helps

Nick

ivelin
Offline
Joined: 2003-07-13

Hi Nick,

Your idea makes sense to me. All (or almost all) caching functionality within Mobicents is isolated in the CacheManager and TransactionManager interfaces. Eventually all cache access should be consolidated in the CacheManager.

If you are interested to look into that and provide a pluggable mechanism for the lower-lever cache or transport , feel free to submit a patch.

The preferred way to propose similar change to JBoss Cache is to start a discussion on its development forum.
http://jboss.com/index.html?module=bb&op=viewforum&f=207

Ivelin

buzzheavyyear
Offline
Joined: 2005-06-18

OK - I'll look into it and do this.

Cheers
Nick

buzzheavyyear
Offline
Joined: 2005-06-18

Forgot to ask - are you happy if I use the hedera library, or would you prefer something specific to mobicents. Advantage of hedera is that it's maintained by a team

Cheers
Nick

ivelin
Offline
Joined: 2003-07-13

Nick,

JBoss Cache ensures transactional state replication. This is very important for correct implementation of SLEE semantics.

Hedera seems to be a lower level protocol, which does take into account transactional context. If this is the case, then we need a different abstraction layer that would allow JBoss Cache to be replaces with a different transactional replicatd cache.

Ivelin

buzzheavyyear
Offline
Joined: 2005-06-18

OK. In that case, to keep things simple, can I propose that I focus on making JBossCache be able to support different transports (if they let me) and then mobicents slee can simply continue to use JBossCache. How does this sound?

Nick

mranga
Offline
Joined: 2003-06-06

Hi Buzzheavyyear ( great nom de plume :-) ):

I worked on the original failover stuff. Why do you want to support different transports ( forgive me for jumping into the middle of this thread )? TCPIP for jboss cache is just fine. Remember your group size is going to me small ( maybe 16 nodes max ).

Jboss cache takes care of some things for you such as leader election and so on. It does not seem to be very reliable however. Sometimes the runtime restore code never gets called. Thats what I ran into.

The main thing currently missing from failover is you need to restart several factories on failover recovery. I did some of this ( see RuntimeRestoreTask ) but there's a lot missing here. Yes Ivelin is right on the money - keep using jboss cache. It has the transactional semantics you need.

One important question - take a close hard look at the delivered set in the event router. Does this need to be cached for reliability?

Regards,

Ranga.

buzzheavyyear
Offline
Joined: 2005-06-18

Hi Ranga (interesting profile of you on absoluteanime.com ;) )

I've worked on projects where entities have transports (between nodes) in place and using JBossCache with it's built in JGroups creates a big headache for integration.

As for JBossCache, I mentioned above that I pulled it apart a couple of years ago. The builtin failure detection, leader election stuff comes from JGroups, so whether your problems related to JGroups or JBossCache setting it up correctly is anyones guess. After refactoring, we incorporated another transport and had no problems with node recovery.

Thanks for the input on areas that need to be focussed on. I'm certainly in favour of keeping JBossCache as, as you say, it has a very good feature set.

Cheers
Nick

ivelin
Offline
Joined: 2003-07-13

Hi Nick,

In your past project, which transport did you replace jgroups with?

Ivelin

buzzheavyyear
Offline
Joined: 2005-06-18

Hi Ivelen

We wrote a protocol to run over (a slightly modified form of) gnunet (via jni).

Nick

fram
Offline
Joined: 2004-05-13

The SLEE 1.1 spec is about to intorduce a Marshaler interface to pass event and activity between different SLEE nodes in a cluster environment to allow load ballancing.
The SLEE 1.1 draft does not define a framework to do that.
I hope I answered to your question.
Regards,
Francesco

> Does anyone know if the next version of the spec will
> specify any kind of inter-SLEE event routing? If not,
> I would say it would make sense to go for a Mobicents
> specific solution. This will be IMO a key feature
> which would increase support by the industry
> heavyweights.
>
> Placido

tangoc
Offline
Joined: 2006-01-03

I have been following the discussion forum on fault tolerance. I collected from the discussion that JBOSS cache and clustering concept is used for fault tolerance.
My question is:
During failover what are the sequence of steps that actually takes place? [Meaning, is it the JBOSS that identifies that a SLEE has failed? Is that when the replicated state information gets copied in the clustered environment?] Is it the state information [SBB entity, SBB activity etc] that helps in continuing the service in another SLEE?

Would highly appreciate any inputs on this.
Thanks, Chandrika

ivelin
Offline
Joined: 2003-07-13

High availability and fault tolerance for SLEE in the general case can be interpreted quite vaguely.

It makes more sense to look at fault tolerance for more specific problem domains such as SIP call failover. This is incidentally the scenario that has been the focus of our most recent work (lead by Ranga and Fran). Take a look at the white board which covers more details:
http://wiki.java.net/bin/view/Communications/MobicentsHADemo

Ivelin

tangoc
Offline
Joined: 2006-01-03

Thanks for the inputs Ivelin- Chandrika

makeeasy
Offline
Joined: 2005-05-12

Hi guys,
Warning: I am new to these subjects so please bear with my ignorance :D

1) In case one uses a front load balancer, doesn it need to be at a higher level (like an Layer7Switch/ a proxy) to make sure events related to the same call are going to be handled by the same SSb in the same JSLEE?
2) How about fail over, the state needs to be replicated by means of group communications (jgroup or something), so how is this done without SSBs communicating?

mranga
Offline
Joined: 2003-06-06

Hi Makeeasy,

You are asking excellent questions. Hope you are interested in taking an active role.

1. Yes you are right there needs to be a higher level load balancer. In the case of SIP, one can just hash on callid and route the call to the appropriate cluster -- thus ensuring that all calls for a given call id go to the same cluster.

2. State is replicated by placing sbb entities and activity contexts and deployment state in jboss cache. Underlying, jboss cache uses jgroups but it is tied to the jboss transaction framework so we replicate only on tx commit.
Ranga

nijie8
Offline
Joined: 2005-06-09

in fact ,i can't understand clustering technology clearly.
but clustering provider loadbalance and fault torelant already. why we need a director for load balance? And even for sip ,it is not easy to route all message to the right node.because SBB entity will do with several dialogs. Some activities are created by SBB and attached to SBB.

if now mobicents provide fault torelant. Can someone tell us how to test it step by step?

If we use a sip director to load balance the call,what for other signalling? that's too complicated.

fram
Offline
Joined: 2004-05-13

Right now we implemented an early attempt to provide fault tollerance inside the SLEE, but has not been tested yet and my feeling is that there is a long way to go before achieving fault tollerance. The approach is based on JBOSS Cache and JGroups. Essentially all the replicated data are stored inside the JBOSS Cache. Regarding the scalability issue we are thinking to use load balancing and having different SLEE processing the events.
Right now there is no plan to have SBBs from different SLEE in different JVM to interact with each other.
Any help to test and debug fault tollerance would be really appreciated.
Regards,
Francesco

> Hi all,
>
> I am trying to get a first grasp on the SLEE spec and
> the mobicents implementation. (By the way I would
> like to congratulate all the development team for
> this achievement!)
>
> A few questions regarding reliability and
> scalability:
> Are there any plans to develop a fault tolerance and
> scalability framework for mobicents, such as
> OpenCloud's Savanna?
> Can JBoss clustering be used with mobicents?
> Can SBBs in different SLEEs (in different
> JVMs/machines) interact in a EJB-like fashion? How?
>
> Thank you,
> Placido

jfkavaka
Offline
Joined: 2005-01-13

> Regarding the scalability issue we are thinking to use
> load balancing and having different SLEE processing the
> events.

How would you do load balancing? Does this require SLEE event router with the capability of distributing events over different JVMs (over the network)? I don't think the current version of the spec covers this.

For instance, in order to have a scalable platform, one could have at least two approaches:
- put all service logic in each SLEE and then do some kind of load balancing based on resource adaptor events (e.g. SIP load balancing), but this would require an additional load balancing component and does not cover the general case of distributing SLEE event handling.
- distribute SBBs over different machines, but then we need some kind of (ideally standard) way of routing events over different SLEEs.

best regards,
Placido

fram
Offline
Joined: 2004-05-13

The idea right now is to adopt your first solution, with an external load balancer, but our next goal is fault tollerance not load balancing. The SLEE architecture allows the second solution that you propose but it's implementation specific and Mobicents right does not support this solution.

Message was edited by: fram

jfkavaka
Offline
Joined: 2005-01-13

Does anyone know if the next version of the spec will specify any kind of inter-SLEE event routing? If not, I would say it would make sense to go for a Mobicents specific solution. This will be IMO a key feature which would increase support by the industry heavyweights.

Placido

mranga
Offline
Joined: 2003-06-06

Inter SLEE event routing (not sure what you mean actually but perhaps you mean a distributed slee architecture) is not required for load balancing. You need something like a front end stateless proxy server to direct requests to a cluster.

Incdentally if you are interested in contributing failover is ineeed a good area to work in.

Ranga

Message was edited by: mranga