The Therac-25 was not a device anyone was happy to see. It was a radiation therapy machine. In layman’s terms it was a “cancer zapper”; a linear accelerator with a human as its target. Using X-rays or a beam of electrons, radiation therapy machines kill cancerous tissue, even deep inside the body. These room-sized medical devices would always cause some collateral damage to healthy tissue around the tumors. As with chemotherapy, the hope is that the net effect heals the patient more than it harms them. For six unfortunate patients in 1986 and 1987, the Therac-25 did the unthinkable: it exposed them to massive overdoses of radiation, killing four and leaving two others with lifelong injuries. During the investigation, it was determined that the root cause of the problem was twofold. Firstly, the software controlling the machine contained bugs which proved to be fatal. Secondly, the design of the machine relied on the controlling computer alone for safety. There were no hardware interlocks or supervisory circuits to ensure that software bugs couldn’t result in catastrophic failures.
The case of the Therac-25 has become one of the most well-known killer software bugs in history. Several universities use the case as a cautionary tale of what can go wrong, and how investigations can be lead astray. Much of this is due to the work of [Nancy Leveson], a software safety expert who exhaustively researched the incidents and resulting lawsuits. Much of the information published about the Therac (including this article) is based upon her research and 1993 paper with [Clark Turner] entitled “An Investigation of the Therac-25 Accidents”. [Nancy] has since published updated information in a second paper which is also included in her book.
History and development
The Therac-25 was manufactured by Atomic Energy of Canada Limited (AECL). It was the third radiation therapy machine by the company, preceded by the Therac-6 and Therac-20. AECL built the Therac-6 and 20 in partnership with CGR, a French company. When the time came to design the Therac-25, the partnership had dissolved. However, both companies maintained access to the designs and source code of the earlier machines. The Therac-20 codebase was developed from the Therac-6. All three machines used a PDP-11 computer.
Therac-6 and 20 didn’t need that computer, though. Both were designed to operate as standalone devices. In manual mode, a radiotherapy technician would physically set up various parts of the machine, including the turntable to place one of three devices in the path of the electron beam. In electron mode, scanning magnets would be used to spread the beam out to cover a larger area. In X-ray mode, a target was placed in the electron beam with electrons striking the target to produce X-ray photons directed at the patient. Finally, a mirror could be placed in the beam. The electron beam would never switch on while the mirror was in place. The mirror would reflect a light which would help the radiotherapy technician to precisely aim the machine.
On the Therac-6 and 20, hardware interlocks prevented the operator from doing something dangerous, say selecting a high power electron beam without the x-ray target in place. Attempting to activate the accelerator in an invalid mode would blow a fuse, bringing everything to a halt. The PDP-11 and associated hardware were added as a convenience. The technician could enter a prescription in on a VT-100 terminal, and the computer would use servos to position the turntable and other devices. Hospitals loved the fact that the computer was faster at setup than a human. Less setup time meant more patients per day on a multi-million dollar machine.
When it came time to design the Therac-25, AECL decided to go with computer control only. Not only did they remove many of the manual controls, they also removed the hardware interlocks. The computer would keep track of the machine setup and shut things down if it detected a dangerous situation.
The accidents
The Therac-25 went into service in 1983. For several years and thousands of patients there were no problems. On June 3, 1985, a woman was being treated for breast cancer. She had been prescribed 200 Radiation Absorbed Dose (rad) in the form of a 10 MeV electron beam. The patient felt a tremendous heat when the machine powered up. It wasn’t known at the time, but she had been burned by somewhere between 10,000 and 20,000 rad. The patient lived, but lost her left breast and the use of her left arm due to the radiation.
On July 26, a second patient was burned at The Ontario Cancer Foundation in Hamilton, Ontario, Canada. This patient died in November of that year. Autopsy ruled that the death was due to a particularly aggressive cervical cancer. Had she lived however, she would have needed a complete hip replacement to correct the damage caused by the Therac-25.
In December of 1985, a third woman was burned by a Therac-25 installed in Yakima, Washington. She developed a striped burn pattern on her hip which closely matched the beam blocking strips on the Therac-25. This patient lived, but eventually needed skin grafts to close the wounds caused by radiation burns.
On March 21, 1986, a patient in Tyler, Texas was scheduled to receive his 9th Therac-25 treatment. He was prescribed 180 rads to a small tumor on his back. When the machine turned on, he felt heat and pain, which was unexpected as radiation therapy is usually a painless process. The Therac-25 itself also started buzzing in an unusual way. The patient began to get up off the treatment table when he was hit by a second pulse of radiation. This time he did get up and began banging on the door for help. He received a massive overdose. He was hospitalized for radiation sickness, and died 5 months later.
On April 11th, 1986, a second accident occurred in Tyler, Texas. This time the patient was being treated for skin cancer on his ear. The same operator was running the machine as in the March 21st accident. When therapy started, the patient saw a bright light, and heard eggs frying. He said it felt like his face was on fire. The patient died three weeks later due to radiation burns on the right temporal lobe of his brain and brain stem.
The final overdose occurred much later, this time at Yakima Valley hospital in January, 1987. This patient later died due to his injuries.
Investigation
After each incident, the local hospital physicist would call AECL and the medical regulation bureau in their respective countries. At first AECL denied that the Therac-25 was capable of delivering an overdose of radiation. The machine had so many safeguards in place that it frequently threw error codes and paused treatment, giving less than the prescribed amount of radiation. After the Ontario incident, it was clear that something was wrong. The only way that kind of overdose could be delivered is if the turntable was in the wrong position. If the scanning magnets or X-ray target were not in position, the patient would be hit with a laser-like beam of radiation.
AECL carefully ran test after test and could not reproduce the error. The only possible cause they could come up with was a temporary failure in the three microswitches which determined the turntable’s position. The microswitch circuit was re-designed such that the failure of any one microswitch could be detected by the computer. This modification was quickly added and was in place for the rest of the accidents.
If this story has a hero, it’s [Fritz Hager], the staff physicist at the East Texas Cancer Center in Tyler, Texas. After the second incident at his facility, he was determined to get to the bottom of the problem. In both cases, the Therac-25 displayed a “Malfunction 54” message. The message was not mentioned in the manuals. AECL explained that Malfunction 54 meant that the Therac-25’s computer could not determine if there a underdose OR overdose of radiation.
The same radiotherapy technician had been involved in both incidents, so [Fritz] brought her back into the control room to attempt to recreate the problem. The two “locked the doors” NASA style, working into the night and through the weekend trying to reproduce the problem. With the technician running the machine, the two were able to pinpoint the issue. The VT-100 console used to enter Therac-25 prescriptions allowed cursor movement via cursor up and down keys. If the user selected X-ray mode, the machine would begin setting up the machine for high-powered X-rays. This process took about 8 seconds. If the user switched to Electron mode within those 8 seconds, the turntable would not switch over to the correct position, leaving the turntable in an unknown state.
It’s important to note that all the testing to this date had been performed slowly and carefully, as one would expect. Due to the nature of this bug, that sort of testing would never have identified the culprit. It took someone who was familiar with the machine – who worked with the data entry system every day, before the error was found. [Fritz] practiced, and was eventually able to produce Malfunction-54 himself at will. Even with this smoking gun, it took several phone calls and faxes of detailed instructions before AECL was able to obtain the same behavior on their lab machine. [Frank Borger], staff physicist for a cancer center in Chicago proved that the bug also existed in the Therac-20’s software. By performing [Fritz’s] procedure on his older machine, he received similar error, and a fuse in the machine would blow. The fuse was part of a hardware interlock which had been removed in the Therac-25.
As the investigations and lawsuits progressed, the software for the Therac-25 was placed under scrutiny. The Therac-25’s PDP-11 was programmed completely in assembly language. Not only the application, but the underlying executive, which took the place of an operating system. The computer was tasked with handling real-time control of the machine, both its normal operation and safety systems. Today this sort of job could be handled by a microcontroller or two, with a PC running a GUI front end.
AECL never publicly released the source code, but several experts including [Nancy Leveson] did obtain access for the investigation. What they found was shocking. The software appeared to have been written by a programmer with little experience coding for real-time systems. There were few comments, and no proof that any timing analysis had been performed. According to AECL, a single programmer had written the software based upon the Therac-6 and 20 code. However, this programmer no longer worked for the company, and could not be found.
The Aftermath
The FDA declared the Therac-25 “defective”. AECL issued software patches and hardware updates which eventually allowed the machine to return to service. The lawsuits were settled out of court. It seemed like the problems were solved until January 17th, 1987, when another patient was overdosed at Yakima, Washington. This problem was a new one: A counter overflow. If the operator sent a command at the exact moment the counter overflowed, the machine would skip setting up some of the beam accessories – including moving the stainless steel aiming mirror. The result was once again an unscanned beam, and an overdose. The patient died 3 months later.
It’s important to note that while the software was the lynch pin in the Therac-25, it wasn’t the root cause. The entire system design was the real problem. Safety-critical loads were placed upon a computer system that was not designed to control them. Timing analysis wasn’t performed. Unit testing never happened. Fault trees for both hardware and software were not created. These tasks are not only the responsibility of the software engineers, but the systems engineers on the project. Therac-25 is long gone, but its legacy will live on. This was the watershed event that showed how badly things can go wrong when software for life-critical systems is not properly designed and adequately tested.
Filed under: classic hacks
No comments:
Post a Comment