The Learning Zone, BBC Prime

LBS/19-11-2003

To Engineer is human

The learning zone, BBC Prime

The Hyatt Hotel collapse

“ The hotel lobby was crowded for the Friday evening tea-dance. Some of the crowds were on the walkways above the lobby. The strain was too great for the third-floor walkway which collapsed under the walkway below it. That too gave way, and tons of concrete and steel fell on to the walkway below . The victims had no time to get away, all according to eye witnesses just stood still transfixed with terror as the collapsing structure fell on them.”

Challenger explosion

AT§T failure

In America, AT§T isn’t the only long-distance company but it’s the largest and claims to be a world leader in communications technology. Much to their embarrassment, when the system ground to a halt, at least 50 million long-distance calls were lost. It was time wasting and frustrating.

London ambulance service

The government has ordered an enquiry into the London ambulance service after its computerized call-out system collapsed at the beginning of this week. The new system was supposed to allocate emergency calls more efficiently, instead, it delayed dozens of calls for hours, resulting the unions say in between 10 and 20 deaths.

Only a coroner’s court will decide if Christine Daunting’s husband was one of the victims but she believes he was. Just before 10 o’clock yesterday morning, she dialled 999 because her husband Roger was choking. But it took five or six calls before an ambulance finally arrived at 12.20. By then, her husband was dead.

Disasters like these were caused by engineering failures. Why did they happen and how can engineers prevent them from happening again ?

Within months of opening, two crowded walkways suspended across this vast hall collapsed.

The subsequent enquiry identified just two contributory factors.

The architectural drawings reveal a novel design, the crossbeams on which the two walkways rested were to be supported by a single long rod attached to the ceiling, each cross-beam supporting one walkway. However in implementing the design, the construction company found it hard to assemble the walkways with a single steel rod, so the connection system was fately modified. The single rod was replaced by two shorter rods , the bottom rod supporting the lower beam, while the top rod supported the upper beam. But this new arrangement meant the upper beam now bore most of the weight of both walkways, a weight that it could barely take unloaded. The additional weight of people on the walkways that night, was too much.

The Challenger disaster was caused by faults in the design of the booster rockets, the shock waves launched by the ignition of the two booster rockets. They’re tacked with a rubber like solid fuel which burns at about 6000 degrees Farenheit. The booster rockets were made in Utah, some 2000 miles away from the launch pad. To make them easy to transport, each booster had to be built in 4 separate sections. Each section was connected to the next by joints, containing two rubber O rings, surrounded by asbestos putty. When the motor ignited, the putty should be forced along the joint , pushing the O ring into place and sealing it. If the first seal failed, the second O ring should act as a backup. In Challenger, both rings and the putty failed, letting the hot gases escape and causing the explosion.

NASA engineers said afterwards there were two main reasons why the seals failed. Firstly routine tests had made holes in the putty which weakened it. And secondly freezing temperatures on the day of the launch hardened the rubber in the O rings so that they didn’t move to seal the joints.

It’s easy to see how mechanical faults, hardware faults can lead to disaster. The failure of the O rings on the booster rockets, and the collapse of the Hyatt Regency Hotel walkways are clearly engineering failures, but engineering isn’t just about hardware, it’s also about software.

AT§T learned the hard way about what can happen when software engineering goes wrong.

Initially the problem was described as an anomaly, but the scales suddenly became clear. Investigators traced the failure of the telephone network to a fault in the new computer controls switching system called SS7. The problem began at just one switching centre in New-York, where a minor mechanical malfunction tricked the switch. The controlling computer sent messages to over 100 other centres nationwide, instructing them not to route new calls to New-York, until the switch there had been reset. 6 seconds later the reset was completed and so the New-York switching centre recommenced sending out long distance calls . As each switching centre computer received the calls, it updated its software to renew their routine to New-York. Unfortunately, the timing between the calls, somehow triggered a software fault at the receiving centres . And although each switching systems had been designed with a backup computer, the backup computers reacted in an identical way, they too shot down. This sequence of faults snowballed as each centre came back on line, the long-distance calls it sent triggered faults at receiving centres, causing them to shut down to reset.

Although the network had been designed to prevent catastrophic failures, by duplicating nearly every piece of equipment, the computer architecture couldn’t cope when a software error struck both primary and backup computers at once. But why hadn’t this failure been detected before the system went on line ?

Martyn Thomas, Chairman, Praxis Touche Ross : “ It’s very hard to test a system, and by doing so to show that it really does work. The only way that you could imagine doing it would be either to test it exhaustively, which might, for even a small system involve running billions of tests, or to test it statistically, in which case firstly you’re assuming that you know the operating conditions you will work under, and secondly, you need to test for a very long time. If you’re looking for a failure rate of once a year, you need to test perhaps for ten years, perhaps for a hundred years, to get a 99% confidence that actually you will achieve that one year without failure, and you need to test for that length of time with no faults being found.”

A typical telephone system contains millions of lines of code. In the SS7 system, there are around 10 million lines, far too many to be able to check them all.

The London ambulance service deals with almost 2000 emergency calls every day. In the autumn of 1992, the London ambulance service installed a new computerized system to manage the dispatch of vehicles in response to calls.

Initially all worked well, but within weeks the system failed with tragic consequences. The

report on the enquiry identified two major reasons , both involving the computers known as

file servers. The first was a minor programming error. The report states that “in carrying out

some work on the system, some three weeks previously, the programmer had inadvertedly left in a system a piece of program code that caused a small amount of memory within the file server to be used up and not released every time every time a vehicle mobilisation was generated by the system. Over a three-week period, these activities had gradually used up all available memory, thus causing the system to crash.

The second reason was that although the original specification included the provision of another computer to act as a back up file server to take over in the event of problems, the fall back to the second server was never implemented .

The enquiry also determined that the programming error would not have been detected through conventional programmer or user testing and that it was caused by carelessness and lack of quality assurance.

“When a software fails, the consequences can be minor, or they can be catastrophic. It depends what the computer system was designed to be doing, for example, there have been systems in hospitals that have failed and the patient, which was being monitored by a computer system got into some distress, the alarms were not triggered as they should have been, the patients have suffered, perhaps some patients have died under the circumstances.”

Radio therapy began a hundred years ago with the advent of the Xray. Here was a high energy source of radiation that could penetrate tissue and destroy body cells. What turned this inherently destructive process into a cancer cure, was the fact that rapidly multiplying tumor cells were much more vulnerable to radiation that normal ones. Radio therapists exploit this difference by attacking tumors either with a powerful external beam, or by placing a radioactive source inside the patient. At least 200000 people will receive some form of radio-therapy in Britain this year.

However there have been failures with the radiation dosages that patients received.

Carol O’Keefe : “It’s just inexistent. I can’t do anything. I can’t go shopping, I can’t do the housework . I find it hard to concentrate to read a book, I can’t do anything.”

What left Carol almost totally incapacitated was not the cancer but the cure. 8 high dose treatments of internal radiotherapy killed off the cancer cells, but the radiations badly burned her pelvic area.

Nowadays, many Xray machines are controlled by software. Between June 1995 and January 1997, 6 people in the United States died, following treatment by a machine called Serac 25 . Two problems were highlighted. The first was that the machine ignored amendments to the original settings if the operator made the corrections before the machine had finished responding to the first set. So operators thought they had set safe radiations dosage when they hadn’t.

The second problem was involved with giving the treatment. There were three possible positions for the turntable containing the machinery, controlled by software, via three micro-switches. Each position corresponded to one of three modes, light mode where a simulated beam allowed the operator to position the patient, Xray mode, where patients received small dosage of high intensity radiation, and electron mode, when patients received larger doses of low intensity radiation.

Raw Xray beams are very dangerous and so in electron therapy, the beam is scattered by scan magnets to a safe concentration. However in Xray mode, the beam must be flattened to alleviate the much higher radiation dosage, but faults in the micro-switch codes meant some patients were given the wrong dosages of radiation. They received electron mode treatment, but the beam was set at a much higher Xray mode level of intensity.