Posted by cos
on March 29, 2007 at 4:53 PM PDT
I hope this article will help to scratch issues of software reliability. Brief discussion about established reliability practices and what has to be changed.
Seems like the process of bring Java under an open source license has
raised the question of Java platform quality even higher. In
particular the question of reliability has been discussed more and
more widely among my peers over last few months. Thus, I decided to
share a couple of thoughts on the topic. Hopefully, you'll like what
you'll about to see.
What realibility means for us.
According to IEEE definition, reliability is "The ability of a system
or component to perform its required functions under stated conditions
for a specified period of time." 
First of all, I'd like to emphasize words "required functions",
"stated conditions", and "specified period". Also I want to add
"repetitively". I believe later on it will be clear, why I've focused
The majority of software reliability studies are paying a good deal of
attention to the amount of time a system or component can perform
without a failure. Most of it is dealing with different kinds of
fault/time distributions, estimations of failure intensity, failure
likelihood probabilities, failure intervals, and such. I beg your
pardon for rather long citation: "...There is a lot of lore about
system testing, but it all boils down to guesswork. That is, it is
guesswork unless you can structure the problem and perform the testing
so that you can apply mathematical statistics. If you can do this,
you can say some- thing like "No, we cannot be absolutely certain that
the software will never fail, but relative to a theoretically sound
and experimentally validated statistical model, we have done
sufficient testing to say with 95-percent confidence that the
probability of 1,OOO CPU hours of failure-free operation in a
probabilistically defined environment is at least 0.995." When you do
this, you are applying software-reliability measurement." 
With no attempt to underestimate or undermine such studies (), and
being in the agreement with absolute necessity of statistical modeling
and verification of the software testing, I want to talk about a wider
approach. It isn't perhaps a brand new one, but might be slightly
different from what you've seeing so far.
Different takes on quality.
I love to talk about quality, mostly because this is very vast topic
and one can sell some nonsense :-)
As I see this, there are two main quality approaches. I'll call them
hardware and software types. The main differences between those are
coming from the production cycle specifics of devices and
applications. Namely these are:
- hardware production has much higher costs because of complex
factory processes, complicated and costly equipment involved, et
cetera. Thus, you'd better be careful with how a device's
components are designed, produced, assembled together, and
tested. It might cost a fortune to make changes in a silicon chip,
a motherboard design, or a car once it's out
Eventually, hardware development is addressed with more "respect"
and precise planning because of high up front investment.
- software, on the other hand, usually has more flexible life span,
the targets are sometimes easily moved along the development
process, requirement are changed, design documents might be
somehow informal, specs changes might not be well tracked down to
the real application defects, quality process gets stuck behind,
and on, and on... At the end of the day a software application
reaches its customers and they start finding bugs in it. Then an
escalation is being arisen. And the product's sustaining team has
to spend time to mirror customer's setup, repeat all the steps to
reproduce a defect, etc. And consider yourself lucky if all of
this can be done just in one interaction. However, if defect
report wasn't detailed enough or the setup was a way too
sophisticated you might spend months to nail down a particular
problem. We all saw this many times, right?
Our take on the problem represents a mix between those two above as
follows. I wanted to bring the best parts of the hardware reliability
and bring it over to the software one wherever possible. Here's what I
see as necessary steps:
1) Design and architectural reviews (many teams are doing this
1a) Tracking correlations between architecture decisions, changes and
2) Mean-Time-To-Failure (MTTF) testing. A quality department can run
some preferably standardized applications for a prolonged period
of time to demonstrate the stability of the software
platform. However a better approach would be to run scenario
based MMTF tests.
3) Employing statistical analysis of quality trends
4) Enforcing static analysis valuations on the periodic basis
5a) Scenario based MMTF testing. Normally, one can gather a few (may
be a hundred or so) typical usage scenarios for a software
application. The number likely to be much higher for a software
platform like Java. These scenarios might be simulated or
replicated with a test harness of choice and a specific set of
existing or newly developed tests. Of course, you might not be
able to simulate any of these real-life scenarios with 100%
accuracy, but it's not always necessary. These scenarios then
should be executed repeatedly and their pass/fail rate has to be
tracked over time.
5b) Scenarios completeness. Using a list of features, utilized
during a scenario execution, and static analysis results one can
tell which parts of a software application will be touched during
a particular scenario's run. Using code coverage methods you can
findout which parts of the scenario's functionality are covered
or not. With something similar to BSP
you can leverage efforts of the improvements, but this is another
story and it's been covered already.
6) Quality trends monitoring. The proceedings of #5a should be
When I'm communicating these steps to my peers and colleagues I'm
hearing a number of concerns. Typically, these are:
- how #2 is connected to the reliability
- #4 seems to be an over stretch
- how you can be sure that #5a is the same as running heavy weight
applications to verity your platform stability/reliability
Hopefully, I'll be able to answer these or other questions, you might
send to me as your comments.
1) Why design and architectural review? Long story short, you can
keep bad solutions away from your system. Proven practices
usually guarantee lesser amount of last minutes changes at the
development stage. Thus, the testing burden will be lower, as
well as amount of regressions, customer escalation, etc.
What about 1a)? I don't know - it just sounds cool, I guess :-)
2) Everybody seems to be doing this, so why don't we..? Seriously,
this one of the reliability's aspects you want to count, because
it backed up by well developed theory and years of practice, and
this one is meaningful quantitative metric.
3) Not sure why? Just read some of those books, will ya? ;-)
4) Static analysis is capable of finding types of defects, which
aren't likely to be discovered in the runtime. That happens
because for complex systems you can't guarantee a coverage of
Cartesian product of input and output states sets. However, some
of the nasty bugs are tend to be hiding right in those dusty
corners, which you or one of your customers only might hit once
in a while. Thus, if you are running a designated static
analyzers cleanly on every build of yours, you at least can
demonstrate, that it doesn't leak memory or running out of file
handles. Consider that reliable also means trustworthy.
One might say, that you can track memory leaks with runtime
monitoring. True. But how you'll going to find and fix them now?
5a) Is giving you the determinism of testing repetitiveness which is
likely to be missed with BigApps approach, discussed later.
5b) is complimentary to this one.
6) You want to know if your development/quality processes are
And finally I'd like to mention several common reliability
approaches. Also I'm going to explain why I see these as
1) Stress testing
This one is most oftenly being messed up with reliability
concept. The reason for this is perhaps clear, because one might
expect from a reliable system to work in a wide variety of
conditions and perform its functions well.
You can hear a word on the streets, that '...Microsoft Windows is
unreliable." Hell, yes. It sure isn't if you'll try to debug a huge
C++ project, process some statistical data, get a bunch of spam
emails, and install 20+ security fixes from their update center at
the same time. It will likely to crash and destroy some of your
files, or it might hung nicely. Or you'll suffer some critical
performance degradations. I can't tell for sure, 'cause I'm not one
of those lucky Windows users. And I'm not trying to make fun of
Windows - people are doing so with their computers on a daily basis
much better than I can even dream of :-) My point is to tell, that
the scenario above is a bit extreme and well beyond an average
Windows' users capabilities or, perhaps, desire.
However, normally your Visual C++ project debug session will go
smoothly in probably 95% of cases (however, I once was doing some
C# project, which crashed my development machine to BSoD on every
load. But an after crash attempt to load it again was always a
success. Weirdo...). Did you ever count how many times your email
client worked well when you were sending your emails? Perhaps not,
but I'm sure almost everyone has a story to tell about so badly
corrupted was address book last time the Outlook crashed, right?
Correctly data series processing and features usage information
gathering can relatively easy demonstrate that the Outlook is
reliable application. It has, say, 93.5% failure free behavior over
every 10 hours of execution. But it hard to guarantee that this
application will survive under some monstrous load conditions.
2) BigApps testing
The concept of BigApps testing consists of running some bulky
commercial applications to derive MTTF for. usually, a software
platform. Well, I see three fold problem here (I'm sure there more
of these, but I'll let you to deduce them on your own:)
1. Any BigApp run is as good as the typical utilization (or
usage scenario) of features of your platform by that
2. The correctness the exercised application itself might be
3. Results you'll see at the completion of a run should be
considered accountable to that particular application. If you
were running a PeopleSoft system for a week and have
demonstrated a MTTF=140 hours that is great... For PeopleSoft
marketing and PR team, but not that useful for your
development organization. That gives them a little of handy
info. Although, if a crash will occur the engineering team
can discover some really bad problem in the code and fix
it. Which is the rare case of non-zero sum game!
It might be cool marketing or sales tool to use on customers,
but it might be not as great for engineers.
I hope this article helped to scratch some surface of this
problem. Please let me know what do you think, pinpoint the lacks of
logic, or just yell at me if you think I'm wrong. Let's communicate
about this. May be we'll work something out that we all can use later
for the applications and products we develop for life or for an
 IEEE 982.2-1988 standard. Has been withdrawn in 2002
 John D. Musa. A. Frank Ackerman."Quantifying Software Validation:
When to Stop Testing?"
The post is also posted at at my permanent blog spot