SOFTWARE FAULT TOLERANCE
9.1 Introduction
Fault-Tolerant Computing Systems:
Systems capable of recovering from failure of their hardware or software components to provide uninterruptable real-time service.
Hardware Fault-Tolerance
Characteristics of Hardware
Hardware deteriorates over time: Physical Faults
Low Complexity: No Human Design Errors
Fault-Tolerance
Redundancy:Replication of identical critical
components
Conclusion: Very Successful
Examples:
ESS No. lA Switching System
SIFT & FTMP
Software Fault-Tolerance
Characteristics of Software
Software doesn't deteriorate over time
High Complexity - Human Design Errors
Elimination of all the design errors - impossible
Fault-Tolerance
Redundancy: Replication of identical functions
Design Diversity: Independent Designs for the same Functions
Forward Error Recovery:
Identify the error and, based on this knowledge, correct the system state containing the error.
Backward Error Recovery:
Correct the system state by restoring the system to a state which occurred prior to the error.
Recovery Block Schema (RB)
1975 by Brian Randell & Tom Anderson
N-Version Programming (NVP)
1976 by Algirdas Avizienis
9.2 Recovery Block Schema (RB)
1. Basic Idea
·N versions of program for the same function
One primary & N-1 alternates
. Acceptance test to check the results of a version.
if the results are accepted then exit; else
Rollback to restore previous correct state;
Execute next alternate version;
endif
·Primary: Fast, correctness unproved
·Alternates: slow, correctness proved
·Recovery Point: a saving of the system state
·Try: an execution of a version
·Test point: an execution of the acceptance test
- General Syntax
ensureAcceptance Test
byPrimary
else byalternate 1
else byalternate 2
else byalternate N-1
elseraise error
2. Acceptance Test Design
- Inverse Functions - Numeric Computation
Square-root
- Invariant Relationships
Constant Acceleration in aircraft control
Connectivity in a double-linked list
- Watchdog Timer
Check the execution time of an alternate for dead loop
3. Implementations
Sequential (Conceptually default)
Concurrent (all versions are executed at the same time and first acceptable result is used.)
9.3 N-Version Programming (NVP)
1. Basic Idea
- N versions of program for the same function.
- The inputs are supplied to all the versions.
- Voter (A decision Mechanism) compares all the results to determine a consensus result.
- Architectural View
2. Implementations
Parallel (Conceptually default, all versions are executed in parallel
Sequential (versions are executed one after another and voting is performed after all versions are completed.)
9.4 Fault-Tolerant Concurrent Systems
1. Domino Effect
If communications are not coordinated with recovery points, backward recovery may create an uncontrolled rollback of many processes.
2. Conversation Schema
Processes that are members of a conversion may communicate with each other but not with processes outside of the conversion
9.5 Summary
1. Diversity
Random-Diversity:
Different designers and programmers will independently choose distinct approaches when solving a problem.
Enforced Diversity:
Systematically specify diverse data structures and algorithms to be used in the various program versions
Different Programming Teams Different Design Methodologies Different Algorithms
Different Programming Languages4
References
1. K.H. Kim, "Software Fault Tolerance," Handbook of Software Engineering, Ed. C.R. Vick and C.V. Ramamoorhy, Van Nostrand Reinhold Company, New York, 1984, pp. 437-455.
2. J.P.J. Kelly, TI. McVittie, and W.I. Yamamoto, "Implementing Design Diversity to Achieve Fault Tolerance," IEEE Software, July 1991, pp. 61-71.
3. R.J. Abbott, "Resourceful Systems for Fault Tolerance, Reliability, and Safety," ACM Computing Survey, Vol. 22, No. 1, March 1990, pp. 35-68.
4. J.M. Purtilo and P. Jalote, "An Environment for Developing Fault-Tolerant Software," IEEE Trans. on SE, Vol. SE-17, No. 2, February, 1991, pp. 153-159.
5. A. Avizienis, "The N-Version Approach to Fault-Tolerant Software," IEEE Trans. on SE, Vol. SE-11, No. 12, December 1985, pp. 1491-1501.
6. P.A. Lee and T. Anderson, Fault Tolerance: Principles and Practice, 2nd Edition, Springer-Verlag Wien, New York, 1990.