The Perils of Race Conditions

CAUTION: May cause loss of power or loss of life

Page 1 of 1

2 Replies - 493 Views - Last Post: 18 September 2008 - 08:51 PM

#1 akozlik  Icon User is offline

  • D.I.C Addict
  • member icon

Reputation: 90
  • View blog
  • Posts: 797
  • Joined: 25-February 08

The Perils of Race Conditions

Posted 18 September 2008 - 04:26 PM

I was reading through my Enterprise Computing slides, and I read this really interesting tangent about race conditions. If you know about multi-threading, you know about race conditions. Basically a race condition is when shared data is not synchronized between different threads. One thread may update a piece of data, but another thread might access the old data instead.

Now for the tangent

Quote

You may remember the large North American power blackout that occurred on August 14, 2003. Roughly 50 million people lost electrical power in a region stretching from Michigan through Canada to New York City. It took three days to restore service to some areas.

There were several factors that contributed to the blackout, but the official report highlights the failure of the alarm monitoring software which was written in C++ by GE Energy. The software failure wrongly led operators to believe that all was well, and precluded them from rebalancing the power load before the blackout cascaded out of control.

Because the consequences of the software failure were so severe, the bug was analyzed exhaustively. The root cause was finally identified by artificially introducing delays in the code (just like we did in the previous example). There were two threads that wrote to a common data structure, and through a coding error, they could both update it simultaneously. It was a classic race condition, and eventually the program “lost the race”, leaving the structure in an inconsistent state. That in turn caused the alarm event handler to spin in an infinite loop, instead of raising the alarm. The largest power failure in the history of the US and Canada was caused by a race condition bug in some threaded C++ code. Java is equally vulnerable to this kind of bug.


Also, race conditions have caused people to die:

Quote

Starting in 1976, the Therac-25 treatment system, built by Atomic Energy of Canada Limited (AECL) and COR MeV of France, was used to fight cancer by providing radiation to a specific part of the body in the hope of destroying tumors.

Six known Therac-25 accidents have been documented, all involved massive overdoses of radiation and three resulted in the death of the patient, serious long-term injury and disfigurement occurred in the other cases. Patients received an estimated 17,000 to 25,000 rads to very small body areas. By comparison, doses of 1000 rads can be fatal if delivered to the whole body.

Analysis determined that the primary cause of the overdoses was faulty software. The software was written in assembly language and was developed and tested by the same person. The software included a scheduler and concurrency in its design. When the system was first built, operators complained that it took too long to enter the treatment plan into the computer. As a result, the software was modified to allow operators to quickly enter treatment data by simply pressing the Enter key when an input value did not require changing.

This modification created a synchronization error (a race condition developed) between the code that read the data entered by the operator and the code controlling the machine. As a result, the actions of the machine would lag behind the commands the operator entered. The machine appeared to administer the dose entered by the operator, but it fact had an improper setting that focused radiation at full power to a tiny spot on the body.

The race condition was subsequently found to occur only when a certain non-typical keystroke sequence was entered (an “X” to select a 25MeV photon followed by “cursor-up” ,”E” to correctly set the 25MeV Electron mode, then “Enter”), since this sequence of keystrokes did not occur very often, the error went unnoticed for a long time.

AECL was ultimately cited for improperly testing the software, which was only tested on site in hospitals after a machine was assembled in place.

The designer had reused software from older Therac-6 and Therac-20 models that had hardware interlocks which masked the software defects. Some operators noted that certain situations caused the machines to display MALFUNCTION followed by a number between 1 and 64 on the display screen. However, the user manual did not explain nor even address error codes, so the operators pressed the “P” key (for proceed), to override the warning and proceed with the treatment.


Is This A Good Question/Topic? 0
  • +

Replies To: The Perils of Race Conditions

#2 Programmist  Icon User is offline

  • CTO
  • member icon

Reputation: 252
  • View blog
  • Posts: 1,833
  • Joined: 02-January 06

Re: The Perils of Race Conditions

Posted 18 September 2008 - 08:45 PM

I think everyone who's taken an intro to programming course is aware of what a race condition is and has heard of these famous software blunders, but thanks for the public service announcement. :)
Was This Post Helpful? 0
  • +
  • -

#3 WolfCoder  Icon User is offline

  • Isn't a volcano just an angry hill?
  • member icon


Reputation: 784
  • View blog
  • Posts: 7,613
  • Joined: 05-May 05

Re: The Perils of Race Conditions

Posted 18 September 2008 - 08:51 PM

Getting threads to work together hasn't been too hard for me, but it is annoying when memory does that to me.
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1