Posted by jonathansimon
on March 14, 2005 at 12:08 PM PST
What happens when software goes wrong in the worse possible way, and what you can learn from it.
I sent the googlezon movie John Mitchell posted on his blog last week to a few friends. My wife thought it was pretty cool/scary/etc and started some snooping in her down time. She pointed me to this incredibly scary story about what happens when software goes wrong, the story of the Error 54 on the Therac-25 .
As scary as the fictional look at our googlezon future is, the real-life current state of software is even scarier -- in this case causing peoples deaths. The Therac-25 was a computerized radiation therapy machine in the 1980s. Essentially, it had some serious software issues, combined with poor testing, and general understanding that lead to over-radiating and killing several patients. This article gives a bone chilling chronological history of the system errors that occured, the debugging from the hospital scientists, the communication between the hospitals and the vendors -- and the eventual over-radiation and deaths of several cancer patients.
From the article...
"It is still a common belief that any good engineer can build software, regardless of whether he or she is trained in state-of-the-art software-engineering procedures. Many companies building safety-critical software are not using proper procedures from a software-engineering and safety-engineering perspective."
I talk every now and again about software quality (or lack thereof) and I thought this would be a good reminder to us all. Obviously, most of us aren't working on life critical systems -- but the point is the same.
On a related note, you also know that I am a strong advocate of interaction issues. So, while I'm bringing up this chilling tale of system malfunction, I also want to bring up another story about a system that caused deaths because of poor design.
This is the story of John Denver dying in an airplane that was functioning exactly as planned. John was flying a plane designed by Burt Rutan, the designer of the Voyager -- the first aircraft to circumnavigate the globe without refueling. The problem apparently stemmed from the builder making design decisions causing serious confusion with the user interaction of the plane, eventually causing John's death.
The reason I point this out is that it was not an engineering issue. In fact, as Tog points out "One of his Long EZ planes, similar to John Denver's, holds the altitude record for conventional aircraft. It is a brilliant design, and is well respected in the aviation community." So, everything was working as planned, except there was an interaction design issue so severe it was actually able to kill its user. So, even if you have a provably correctly functioning system, you're only half way there.
I'll wrap up with a few thoughts from the Therac-25 paper.
"If we are to prevent such accidents in the future, we must dig deeper. Most accidents involving complex technology are caused by a combination of organizational, managerial, technical, and, sometimes, sociological or political factors ... Although these accidents occurred in software controlling medical devices, the lessons apply to all types of systems where computers control dangerous devices. In our experience, the same types of mistakes are being made in nonmedical systems."
And why I wrote this blog in the first place...
"We must learn from our mistakes so we do not repeat them."