Software Safety Basics - Michigan Technological University
Software Safety Basics (Herrmann, Ch. 2) 1 CS 3090: Safety Critical Programming in C Patriot missile defense system failure On February 25, 1991, a Patriot missile defense system operating at Dhahran, Saudi Arabia, during Operation Desert Storm failed to track and intercept an incoming Scud. This Scud subsequently hit an Army barracks, killing 28 Americans. [GAO] http://news.bbc.co.uk/1/shared/spl/hi/middle 2 CS 3090: Safety Critical Programming in C
Patriot: A software failure 3 [A] software problem in the systems weapons control computer led to an inaccurate tracking calculation that became worse the longer the system operated. At the time of the incident, the battery had been operating continuously for over 100 hours. By then, the inaccuracy was serious enough to cause the system to look in the wrong place for the incoming Scud. [GAO] CS 3090: Safety Critical Programming in C Tracking a missile: what should happen
4 Search: Wide range scanned When missile detected, range gate calculates the next area to scan Validation, Tracking: Only range gated area scanned CS 3090: Safety Critical Programming in C Software design flaw Range gate calculates predicated position from
Problem: 5 Time of last radar detection: integer, measuring tenths of seconds Known velocity of missile: floating-point value Range gate used 24-bit registers, and each 0.1second time increment added a little error Over time, this error became significant enough to cause range gate to miscalculate missile position CS 3090: Safety Critical Programming in C What actually happened 6 Range gated area shifted, no longer accurate
CS 3090: Safety Critical Programming in C Sources of the problem Patriot designed for use against slower (Mach 2) missiles, not Scuds (Mach 5) Patriot system typically used in short intervals no longer than 8 hours 7 Proper calibration not performed largely due to fear that adding an external recorder could crash the system(!) Supposed to be mobile, quick on/off, to avoid detection
CS 3090: Safety Critical Programming in C Ariane 5 failure 8 On 4 June 1996, the maiden flight of the Ariane 5 launcher ended in a failure. Only about 40 seconds after initiation of the flight sequence, at an altitude of about 3700m, the launcher veered off its flight path, broke up and exploded. http://www.vuw.ac.nz/staff/stephen_m CS 3090: Safety Critical Programming in C Unexpecte 5: A software failure
Ariane Attempt to dly large values encountere d during alignment of inertial platform Software exception 9 convert overly large 64-bit value into a 16-bit value Guidance system (hardware) shutdown
CS 3090: Safety Critical Programming in C Sources of the problem Alignment code reused from (smaller, less powerful) Ariane 4 Velocity values of Ariane 5 were out of range of Ariane 4 Ironically, alignment not even needed after lift-off! Why was alignment code running?
10 Engineers decided to leave it running for 40 seconds after planned lift-off time Permitting easy restart if launch was put on hold briefly CS 3090: Safety Critical Programming in C Panama Cancer Institute accidents (Gage & McCormick, 2004) November 2000: 27 cancer patients given massive doses of radiation 11
Partly due to flaws in Multidata software Medical physicists who used the software were found guilty of 2nd degree murder in Panama Note: In the well-known Therac-25 incidents of the 1980s, software failures led to massive doses of radiation being administered to patients. Do we ever learn?... CS 3090: Safety Critical Programming in C Multidata software Used to plan radiation treatment 12 Operator enters patient data
Operator indicates placement of blocks (metal shields used to protect sensitive areas) through graphical editor Software provides 3D prediction of where radiation would be distributed From this data, dosage is determined CS 3090: Safety Critical Programming in C NRC Information Notice 2001-08, Supp. 2 Block placement editor 13 Blocks drawn as separate polygons
(There are 2 blocks in this picture) Software limitation: At most 4 blocks What if doctors want to use more blocks? CS 3090: Safety Critical Programming in C NRC Information Notice 2001-08, Supp. 2 A solution 14 Note: This is a single unbroken
line Software treated it as a single block Now you can draw more blocks! CS 3090: Safety Critical Programming in C Fatal problem Dosage prediction algorithm expected blocks in the form of polygons, but graphical editor allowed non-polygons When run on non-polygon blocks, predictions were drastically wrong; overly high dosages prescribed 15 CS 3090: Safety Critical Programming
in C What is software safety? Features and procedures which ensure that a product performs predictably under normal and abnormal conditions, and the likelihood of an unplanned event occurring is minimized and its consequences controlled and contained; thereby preventing accidental injury or death, whether intentional or unintentional. (Herrmann) 16
CS 3090: Safety Critical Programming in C Features and procedures Features: built into the software itself Range checks; monitors; warnings/alarms Procedures: concern the proper environment for the software, and its proper use 17 Computer hardware that the software runs on Physical, mechanical components of environment Human users
CS 3090: Safety Critical Programming in C Normal and abnormal conditions Abnormal conditions: Failure of hardware components Power outage Extreme environmental conditions (temperature, velocity) What to do?
18 Not necessarily the best reaction, but one that has the best chance of preventing injury or death Fail-safe: shut down Fail-operational: continue in simpler degraded mode CS 3090: Safety Critical Programming in C Avoiding unplanned events To Herrmann, human users are the primary source of such events Can produce unusual inputs or combinations of inputs
User interface design, testing can be crucial to software safety 19 Understand user behavior Create interfaces that guide users toward good input CS 3090: Safety Critical Programming in C Terminology alert #1 There are many definitions of safety Herrmann thinks of safety as a set of features and procedures
Leveson: freedom from accidents or losses Something you can actually see in the software This is an idealized property of the software something to aim for rather than actually achieve Storey distinguishes safety from adequate safety 20 Here, safety is close to Levesons definition; adequate safety is closer to Herrmans definition CS 3090: Safety Critical Programming in C Fault, error and failure Fault
Error Failure 21 CS 3090: Safety Critical Programming in C Fault, error and failure: Example Fault Error Failure 22 CS 3090: Safety Critical Programming in C Faults: Hardware vs. software
Some hardware faults may be random Due to manufacturing defects or simple wear and tear Probability can be estimated statistically Well-known techniques to minimize random faults: error-correcting codes, redundant systems Software faults are always systematic not random 23
Generated during design or specification not execution Software is not manufactured and doesnt wear out Techniques for minimizing random faults dont CS 3090: Safety Critical Programming work with systematic faults in C Fault management options Avoidance: Prevent faults from entering the system during the design phase good practices in design e.g. programming standards Removal: Find faults in the system before release
24 Testing costly and not always very effective CS 3090: Safety Critical Programming in C Fault management options Tolerance: Find faults in operational system after release, allow system to proceed correctly Recovery blocks: Create duplicate code modules
Run primary module, then run an acceptance test If test fails, roll back changes and run an alternative module N-version programming: several independent implementations of a program Goal: ensure design diversity, avoid common faults Both approaches are costly, and may not be very effective 25 For a study on whether N-version programming really CS 3090: Safety Critical Programming in C & Levesons achieves design diversity, read Knight Model of system failure behavior fault removed
fault not introduced Perfect fault introduced OK Erroneous error detected Fail Operational error not detected Fail Safe Known Safe State 26 Innocuous
Failure Dangerous Failure Unknown or Dangerous State CS 3090: Safety Critical Programming in C Terminology alert #2 fault and error have many alternative definitions 27 Sometimes, error is a synonym for what were calling fault, and fault means behavior that
may trigger a failure Following these alternative definitions, we have: error fault failure CS 3090: Safety Critical Programming in C References United States General Accounting Office. Report IMTEC-92-26, February 4, 1992. http://www.fas.org/spp/starwars/gao/im92026.htm Ariane 5 Flight 501 Failure Report by the Inquiry Board. July 19, 1996. http://sunnyday.mit.edu/accidents/Ariane5accidentreport.html U.S. Nuclear Regulatory Commission. Update on radiation therapy overexposures in Panama. NRC Information Notice 2001-08, Supp. 2, November 20, 2001. http://www.hsrd.ornl.gov/nrc/special/IN200108s2.pdf D. Gage and J. McCormick. Why software quality matters. Baseline,
March 2004, 33-56. http://www.baselinemag.com/print_article2/0,1217,a=120920,00.asp Nancy G. Leveson. Safeware: System Safety and Computers. Addison Wesley, 1995. Neil Storey. Safety-Critical Computer Systems. Prentice Hall, 1996. J.C. Knight and N.G. Leveson. An experimental evaluation of the assumption of independence in multiversion programming. IEEE Transactions on Software Engineering 12(1), 1986, 96-109. 28 CS 3090: Safety Critical Programming in C
ENTERPRISE SYSTEM PLANNING STAGES OF THE SDLC IS 421 Systems Analysis IS 422 Systems Design PHASES OF THE SYSTEMS DEVELOPMENT LIFE CYCLE Project Identification and Selection Two Main Activities Identification of need Prioritization and translation of need into a development...
function prototype (in <stdio.h>): int fputc (int . c, FILE * stream); writes . c. to the . stream. returns the (integer) c. if successful, EOF else. returns an integer rather than a character to avoid potential problems of a...
We are similar, I think in design and the goals for thinking about integrated skills. In my presentation, I will focus a bit on the SS method that incorporates the integration of skills into the reporting of performance levels, then...
What are the appropriate investigations to diagnose acute cholecystitis?. Acute cholecystitis should be suspected in a patient with fever, severe pain located in the right upper abdominal quadrant lasting for several hours, and right upper abdominal pain and tenderness on...
SUcceed. THREE KEY ELEMENTS . OF SU ACADEMIC CULTURE . LEARN THE CULTURE. RESEARCH-BASED STUDY STRATEGIES. ACADEMIC RESEARCH & INTEGRITY. Introduce the PowerPoint to students by explaining that their academic success depends upon learning the SU academic culture rather than...
Steps for Changing your Password. Sign into your SSO account with the current password. Set up Security Questions. Change your password. If you do not know your username or current password, please contact . [email protected] or call toll-free 1-855-814-2876.
D&C 28:9— 8 And now, behold, I say unto you that you shall go unto the Lamanites and preach my gospel unto them; and inasmuch as they receive thy teachings thou shalt cause my church to be established among them;...