Sample Précis – Chapter 1

Steven Zeil

This chapter opens with a discussion of why software reliability is important. Computer systems have increasing presence in our lives. Complexity of these systems has risen continually, even as their applications have become more crucial, raising the impact of system failures.

System failures have been responsible for a number of expensive delays and have even cost lives.

The upcoming chapters of the book are summarized. These are grouped as technical foundations, practices and experiences, and emerging technqiues.

The remainder of the chapter deals with the introduction of important concepts and terminology for software reliability engineering.

A failure occurs when the program cannot deliver a desired result. Failures may be classified by severity. A special case of a failure is an outage, when the service is unavailable.

A fault is the cause of a failure. Faults may be identified (and then removed) or hypothesized.

Anerror can mean either a discrepancy between an output and the desired output or a conceptual mistake by a human that results in the introduction of a fault.

To summarize, a human commits an error, causing a fault to be embeeded in the design and/or code. Because of this fault, some executions of the program are failures.

Reliability measurement requires some notion of time. Time can be measured as execution time, calendar time, or clock time.

A number of failure functions describe reliability over time.

The cumulative failure function (also called the mean value function) denotes the “average cumulative failures associated with each point of time.” The failure intensity function represents “the rate of change of the cumulative failure function. “ The failure rate function is defined as “the probability that a failure per unit time occurs in the interval [t, t + M], given that a failure has not occurred before t.”[1]

The mean time to failure (MTTF) is the “expected time that the next failure will be observed” (a.k.a MTBF, mean time between failures.)

The mean time to repair (MTTR) is the “expected time until a system will be repaired after a failure is observed”.Availability is the “probability that a system is available when needed” and is computed as

Availability = MTTF / (MTTR + MTTR)

The operational profile of a system is defined as “the set of operations that the software can execute along with the probability with which they will occur”.

Failure data can be collected as failure counts (failures per time period) or as time between failures (interfailure time). Conversion between the two is possible. Using this data, we can try to estimate or predict the software’s reliability through the use of a reliability model.

Software failures are modeled as stochastic processes, as are hardware failures. Unlike hardware, sofware does not wear out. Its reliability usually increases over time rather than decreases.

Questions

There seems to be a certain assumption that hardware people get it right more ofthen. Is that true? Is hardware simpler to design and implement?

What’s a “stochastic process”? Are failures really stochastic? Or are they deterministic?

The chapter opened with a diagram showing subsystems passing results from one to another and taked about how the subsystems could be individually reliable, yet the entire system could be unreliable. Does this depend on how finely you decompose the system when you look at it?

[1]Note that, even in a précis, normal conventions of quotation apply. If you use words directly from the source (which should be done only sparingly, though it makes sense for definitions), you must indicate this by appropriate quoting. Failure to do so is plagiarism!