aaaaaa

Software Faults and Fault Injection Models

Abstract:

Software Faults can be created at any time in any phase of the software development. This paper will explore software faults in the perspective of software reliability. Complex software faults occurring in various systems have been studied and are classified, basing on the behavior of the fault. Various Software fault injection and detection models are studied, and the behavior of the models has been summarized.

Some of the methods for avoidance and detection of software faults are summarized. Various methods of software fault mitigation, in case the software fault cannot be avoided are discussed.

Introduction:

While rapid advances in computing hardware have led to powerful, multi−gigahertz processors, advances in software reliability have not kept pace with this progress.[1] Software program bugs continue to be frequent, in spite of increasing requirements that software be reliable. Nonstop systems have stringent uptime requirements and must be kept running even in the face of hardware or software errors and may be required to be monitored, debugged and patched on the fly.[2] While software program crashes are problematic enough, perhaps more dangerous are undetected errors which silently compromise the results of a computation.

A failure in a computer-based system that controls critical applications may lead to significant economic losses or even the loss of human lives. The causes of failures in computer-based systems are manifold: physical faults, maintenance errors, design and implementations mistakes resulting in hardware or software defects, and user or operator mistakes. [3] These kinds of faults are all undesired circumstances that hinder the system from delivering the expected service. There are two complementary ways to ensure that a system delivers the expected service: fault prevention, i.e. avoid the introduction of faults; and fault tolerance, i.e. ensure that the system delivers its service despite the presence of faults. [4] A fault-tolerant system should tolerate both hardware and software faults, as both categories can have a great impact on it. Furthermore, it is essential that confidence in a fault-tolerant system’s ability is reached if it shall be deployed for critical applications.

One attractive approach to reach confidence in a fault-tolerant system’s capability is fault injection.[5] Fault injection can be used for studying the effects of hardware and software faults. However, in both the academic community and industry, most fault injection studies have aimed at the effects of physical hardware faults. Only a few studies have been concerned with software faults, for the reason that knowledge of software faults experienced by systems in the field is limited. As a result, it is difficult to define realistic fault sets to inject. This is crucial if a fault injection experiment is intended to quantify a system’s fault tolerance. Consequently, more research is needed in the fault injection area - especially studies targeting software faults and errors induced by them.

This term paper contributes towards fulfilling this need by investigating models of software faults and models of errors induced by software faults. Models and Techniques for emulating representative software faults were also studied and analyzed.

Taxonomy of Software Faults:

Faults that affect software executions include hardware faults that lead to software errors (hardware-induced software errors) and software faults (software design/implementation faults) [1]. Faults can be classified into physical faults, design faults and Interaction faults.

Physical faults are the Hardware faults, which may occur in any part of a computer system. Some of the hardware faults can affect program execution directly affecting the software.[6] Such kinds of faults are called hardware-induced software faults. Hardware faults are classified into memory, CPU, bus and I/O faults.

Memory faults are those that corrupt the contents of a particular memory location. They can occur in text segments and data segments.

CPU faults include computation, control flow, and register faults. From the software viewpoint, all these faults result in the corruption of registers. The corrupted registers can be general registers or special registers, such as program counter (PC), next program counter (nPC), the processor state register (PSR), or the stack pointer (SP).

Bus faults can occur on address lines or data lines. They may affect bits in the instructions or data transmitted through the bus.

I/O faults are from peripheral devices. Device drivers are designed to be able to handle these exception situations.

Design faults are the software faults, which can be classified according to their causes of symptoms.[7] Automatic error logs available in several operating systems usually give information about the error symptoms. Analyses of human-collected error reports, especially from manufacturers, can usually provide insight into the causes. Software faults can be classified into initialization, assignment, condition check, functional and documentation faults.

Initialization faults include uninitialized variables and wrongly initialized variables or parameters.[8] The value of an uninitialized variable is compiler-dependent. The value is set to zero if the variable is global, or unknown if the variable is local. Most of the uninitialized variables can be detected by a smart compiler. Wrongly initialized parameters are similar to miss assigned variables. Wrongly initialized parameters are those that are initialized to incorrect values, for example defining a small value to a parameter MAXAREASIZE. Incorrect argument of function calls are also initialization faults because the arguments are wrongly initialized.

Assignment faults can be missing assignments or incorrect assignments. A fault in an incorrect assignment may be in the right hand side causing one incorrect data value (for example using x=y+z and x=y+w corrupts a) or in the left hand side causing two incorrect data values (e.g., using a=b+c for d=b+c corrupts a and d).

Condition check faults include missing condition checks (for example fail to check return values) and incorrect condition checks.

Function faults mean that the faulty parts are not single statement faults, that is, these faults are complicated and the correction of this type of fault involves multi-statement modification or function rewriting.

Documentation faults mean that the system messages or documents are incorrect. These faults do not affect program execution.

Interaction Faults are the faults induced by the use. For example, in Database systems, a mistake done by the Database administrator can cause severe damage to the system or even loss of vital data.

Software fault Propagation Models

Fault propagation models are built for both hardware and software faults. Fault injection has been used to evaluate the dependability of computer systems, but most fault-injection studies concentrate on the final impacts of faults on the system with an emphasis on fault latency and coverage issues.[9] There has not been much research on what happens after a fault is injected and how a fault propagates in a software system.

DIDUCE (Dynamic Invariant Detection U Checking Engine)

DIDUCE (Dynamic Invariant Detection U Checking Engine) is and automatic bug detection tool that dynamically checks invariants in Java applications. DIDUCE instruments Java byte code to perform dynamic and automatic invariant detection and checking.

A program invariant is a property that is true at a particular program point. Invariants explicate data structures and algorithms and are helpful for programming tasks from design to maintenance. Invariants can be dynamically detected from program traces that capture variable values at program points of interest.

DIDUCE helps in debugging programs that fail on some inputs. It is a common occurrence for a program which works correctly on many inputs, to fail on others. DIDUCE can be used to quickly pinpoint differences in behavior between the successful and the failing runs.

DIDUCE helps in debugging failures in long running programs by flagging anomalies prior to the failure. Some of the hardest bugs to track down are those that occur only after a program has executed for a long time. DIDUCE continually monitors all the variables in the program and is better suited to locate such errors.

DIDUCE helps in debugging component based software where the component works in some systems but not in others. For component based software, however, we can first train DIDUCE on other codes that use the same components correctly, and apply it to check the behavior of the component in the context of the new software.

DIDUCE helps in testing programs where the correct output of some inputs is unknown by training on known input/output pairs and testing on the unknown pairs. It aids in program evolution by testing if program modifications affect other portions of code.

DIDUCE associates invariants with static program points (specific locations in program’s code). These points are i) program points which read from or write to objects, ii) program points which read from or write to static variable iii) procedure call sites. Stack accesses are ignored because of overhead and since all Java objects are on the heap.

Automatically tracked expressions/invariants include i) the value being read or written ii) the difference between old and new values after a write iii) the parent object. Users can extend the basic DIDUCE classes to customize their invariant tracking.

Invariants are assigned a confidence level that is a function of the number of successful evaluations. Invariants that have held true for a long time are assigned a high confidence. High confidence invariants that fail often indicate a bug. DIDUCE was implemented to test java programs using the Byte Code Engineering Library. It was used to test four applications i) MAJC a CPU architecture developed at sun with support for on-chip multiprocessing ii) Mail Manage an opens-source email management utility iii)the Java Secure Socket Extension library, and iv) JOEQ – a java virtual machine system with a just-in-time compiler[12].

FINE (A Fault Injection and Monitoring Environment for tracing the UNIX System Behavior Under Faults

The fault injection and monitoring environment (FINE) is a tool to study fault propagation in the UNIX kernel. FINE injects hardware-induced software errors and software faults into the UNIX kernel and traces the execution flow and key variables of the kernel. FINE consists of a fault injector, a software monitor, a workload generator, a controller, and several analysis utilities. Experiments on SunOS 4.1.2 are conducted by applying FINE to investigate fault propagation and to evaluate the impact of various types of faults. Fault propagation models are built for both hardware and software faults. Transient Markov reward analysis is performed to evaluate the loss of performance due to an injected fault [6]. Experimental results shows that memory and software faults usually have a very long latency, while bus and CPU faults tend to crash the system immediately. About half of the detected errors are data faults, which are detected when the system is tries to access an unauthorized memory location. Only about 8% of faults propagate to other UNIX subsystems. Markov reward analysis shows that the performance loss incurred by bus faults and CPU faults is much higher than that incurred by software and memory faults. Among software faults, the impact of pointer faults is higher than that of non-pointer faults.

CONCLUSION

Software fault propagation is an immature area of research. As more and more complex systems get designed and built, especially safety critical systems, software fault tolerance and the next generation of hardware fault tolerance will need to evolve to be able to solve the design fault problem. [10] Unlike fault tolerance practiced in any other field, the necessity to be able to design fault tolerance into the system for design faults and unexpected circumstances has never been greater. The current generation of software fault tolerance methods cannot adequately compensate for these faults. Part of this next generation of software fault tolerance methods will have to include an in-depth view at how to combat the increasing cost of building correct software.[11] It will be necessary for the next generation of fault tolerance methods to be cost effective enough to be applied to the safety critical systems in which they will be necessary.

The view that software has to have bugs will have to be conquered. If software cannot be made (at least relatively,) bug free then the next generation of safety critical systems will be very flawed. Reliable computing systems, often used for transaction servers, made by companies like Tandem, Stratos, and IBM, have shown that reliable computers can currently be made; however, they have also demonstrated that the cost is significant.

In this term paper, I've introduced basic fault propagation concepts, techniques and tools to achieve this special system feature, and also give a description of the type of faults, their manifestation and behavior. In general fault tolerance is considered as a study of faults/ failures, as mastering of faults/failures behavior is the reasonable starting point of stopping their effects as any system defects, and all those techniques and tools are developed towards how to probe this behavior and further how to stop the propagation. As most of the techniques and tools are generated initially for coping with hardware defects, or more effective when applied to hardware world,software fault tolerance still has not been that relatively mature in comparison with hardware. And software fault tolerance research has drawn more and more focus nowadays, as the majority of system defects are shown to be software defects

References:

[1] J. Dures and H. Madeira, “Characterization of Operating Systems Behavior in the Presence of Faulty Drivers Through Software Fault Emulation,” PRDC2002 Pacific Rim International Symposium on Dependable Computing, pp. 16–18, December 2002.

[2] H. Madeira, M. Vieira, and D. Costa, “On the Emulation of Software Faults by Software Fault Injection,” IEEE International Conference on Dependable Systems and Networks, pp. 25–28, June 2000.

[3] J. Arlat, Y. Crouzet, and J. Karlsson, “Comparison of Physical and Software- Implemented Fault Injection Techniques,” IEEE Transactions on Computers, pp. 1115–1133, September 2003.

[4] M. Hsueh, T.Tsai, and R.K.Iyer, “Fault Injection Techniques and Tools,” IEEE Transactions on Computers., pp. 75–82, April 1997.

[5] R. Chillarege, I.S.Bhandari, J.K.Chaar, M.J.Halliday, D.Moebus, B.Ray, and M.Wong, “Orthogonal Defect Classification - A Concept for In-Process Measurement,” IEEE Transactions on Software Engineering, pp. 943–956, November 1992.

[6] W. Kao, R.K.Iyer, and D.Tang, “FINE: A Fault Injection and Monitoring Environment for Tracing the UNIX system Behavior Under Faults,” IEEE Transactions on Software Engineering., pp. 1105–1118, November 1993.