SOFTWARE FAULT TOLERANCE

9.1 Introduction

Fault-Tolerant Computing Systems:

Systems capable of recovering from failure of their hardware or software components to provide uninterruptable real-time service.

Hardware Fault-Tolerance

Characteristics of Hardware

Hardware deteriorates over time: Physical Faults

Low Complexity: No Human Design Errors

Fault-Tolerance

Redundancy:Replication of identical critical
components

Conclusion: Very Successful

Examples:

ESS No. lA Switching System

SIFT & FTMP

Software Fault-Tolerance

Characteristics of Software

Software doesn't deteriorate over time

High Complexity - Human Design Errors

Elimination of all the design errors - impossible

Fault-Tolerance

Redundancy: Replication of identical functions

Design Diversity: Independent Designs for the same Functions

Forward Error Recovery:

Identify the error and, based on this knowledge, correct the system state containing the error.

Backward Error Recovery:

Correct the system state by restoring the system to a state which occurred prior to the error.

Recovery Block Schema (RB)

1975 by Brian Randell & Tom Anderson

N-Version Programming (NVP)

1976 by Algirdas Avizienis

9.2 Recovery Block Schema (RB)

1. Basic Idea

·N versions of program for the same function

One primary & N-1 alternates

. Acceptance test to check the results of a version.

if the results are accepted then exit; else

Rollback to restore previous correct state;

Execute next alternate version;

endif

·Primary: Fast, correctness unproved

·Alternates: slow, correctness proved

·Recovery Point: a saving of the system state

·Try: an execution of a version

·Test point: an execution of the acceptance test

  • General Syntax

ensureAcceptance Test
byPrimary
else byalternate 1
else byalternate 2

else byalternate N-1

elseraise error

2. Acceptance Test Design

  • Inverse Functions - Numeric Computation

Square-root

  • Invariant Relationships

Constant Acceleration in aircraft control

Connectivity in a double-linked list

  • Watchdog Timer

Check the execution time of an alternate for dead loop

3. Implementations

Sequential (Conceptually default)

Concurrent (all versions are executed at the same time and first acceptable result is used.)

9.3 N-Version Programming (NVP)

1. Basic Idea

  • N versions of program for the same function.
  • The inputs are supplied to all the versions.
  • Voter (A decision Mechanism) compares all the results to determine a consensus result.
  • Architectural View

2. Implementations

Parallel (Conceptually default, all versions are executed in parallel

Sequential (versions are executed one after another and voting is performed after all versions are completed.)

9.4 Fault-Tolerant Concurrent Systems

1. Domino Effect

If communications are not coordinated with recovery points, backward recovery may create an uncontrolled rollback of many processes.

2. Conversation Schema

Processes that are members of a conversion may communicate with each other but not with processes outside of the conversion

9.5 Summary

1. Diversity

Random-Diversity:

Different designers and programmers will independently choose distinct approaches when solving a problem.

Enforced Diversity:

Systematically specify diverse data structures and algorithms to be used in the various program versions

Different Programming Teams Different Design Methodologies Different Algorithms

Different Programming Languages4

References

1. K.H. Kim, "Software Fault Tolerance," Handbook of Software Engineering, Ed. C.R. Vick and C.V. Ramamoorhy, Van Nostrand Reinhold Company, New York, 1984, pp. 437-455.

2. J.P.J. Kelly, TI. McVittie, and W.I. Yamamoto, "Implementing Design Diversity to Achieve Fault Tolerance," IEEE Software, July 1991, pp. 61-71.

3. R.J. Abbott, "Resourceful Systems for Fault Tolerance, Reliability, and Safety," ACM Computing Survey, Vol. 22, No. 1, March 1990, pp. 35-68.

4. J.M. Purtilo and P. Jalote, "An Environment for Developing Fault-Tolerant Software," IEEE Trans. on SE, Vol. SE-17, No. 2, February, 1991, pp. 153-159.

5. A. Avizienis, "The N-Version Approach to Fault-Tolerant Software," IEEE Trans. on SE, Vol. SE-11, No. 12, December 1985, pp. 1491-1501.

6. P.A. Lee and T. Anderson, Fault Tolerance: Principles and Practice, 2nd Edition, Springer-Verlag Wien, New York, 1990.