10

REPRINTED from Reliability Review, The R & M Engineering Journal,

Volume23 Number 2, June 2003

SOFTWARE RELIABILITY MEASUREMENT,

or

Life Cycle Core Knowledge Requirements for Software Reliability Measurement

By Norman F. Schneidewind, Ph.D.

T

he purpose of this article is to make the linkage between the need for the measurement of quality and reliability in the software development life cycle and the body of knowledge that is required to satisfy this need..

Identifying the applicable body of reliability knowledge required is the first step in equipping software engineers with the essential skill set

Because measurement is the key to achieving high reliability software, it is important for software engineers to be knowledgeable in this area. The software engineer would apply the body of knowledge to improve the reliability of software throughout the life cycle. In addition, the body of knowledge may be used as guidelines for practitioners, licensing of software professionals, and for training in software reliability measurement. Whenever we refer to “measurement”, we will be referring to the measurement of the reliability of the software. Our rationale is that without measurement, software engineers would not be able to achieve high reliability software. Thus, programmed measurement is key to developing reliable software.

The scope of this article is the measurement knowledge and tools that are necessary to ensure the reliability of the software. We focus on reliability because the lack of it could result in significant costs to the supplier in terms of dissatisfied customers, loss of market share, rework caused by rejected and returned systems, and the costs to customers of faulty systems that fail to meet their mission goals.

The benefit for software engineers of this article is to identify the knowledge and skills that are required to advance the measurement component of software engineering from a craft to a profession. Rather than focus on the coding phase of the development process, as has been the case historically, it is important to identify how measurement can be applied throughout the process and to key the requisite knowledge to the process phases. This approach is important for three reasons. First, early detection and resolution of reliability problems can save considerable time and money in software development. Second, product and process measurements must be integrated so that the interaction between the two can be assessed throughout the life cycle. Third, software engineers must have comprehensive knowledge of the role of measurement in contributing to the development of high reliability products and the processes that produce them.

Some of the required knowledge can be obtained through formal education while other knowledge can only be obtained by on-the-job experience. Preferably, this experience would be obtained in a mentoring program, under the guidance of a certified reliability engineer. We use applicable standards as another resource for deriving knowledge requirements. We focus on knowledge areas where there is the largest gap between existing knowledge and the use of the knowledge by practicing software engineers. These areas are mostly quantitative in nature. The evidence of this gap is the need for various standardization efforts that quantify software reliability and metrics [AIA93, IEE98].

Two Approaches to Identifying

Knowledge Requirements

There are two approaches to identifying the knowledge that is required to plan and implement a software reliability program. One approach is issue oriented, as shown in Table 1. The other is life cycle phase oriented, as shown in Figure 1. The two approaches are compatible but different views of achieving the same objective and have been provided to show the software engineer why (issue oriented) and when (phase oriented) the need for measurement arises. A case study that addresses many of the issues and life cycle factors that we describe here can be found in a report on the NASA Space Shuttle software development and maintenance process [BIL94].

Issue Oriented

Issues arise because there are important considerations in achieving software reliability goals at acceptable cost. Using this approach, the relationships among issues, functions, and knowledge requirements are shown in Table 1. This table shows some of the important functions that would be performed by the software engineer in executing a life cycle reliability management plan, oriented to the issues in the first column; this table is not exhaustive. Using a consensus approach, the issues were identified by approximately two hundred contributors and balloters who developed ANSI\IEEE 1061 Standard for a Software Quality Metrics Methodology [IEE98] and ANSI\AIAA Recommended Practice for Software Reliability [AIA93]. Standards working groups and balloting groups are drawn from a wide variety of industry, government, and academic participants. These contributors and balloters concluded that it is essential to address these issues if a software developer is to be successful in producing high reliability software. Based on several case studies and the opinions of experts in software reliability measurement [BIL94, LYU96, SCH99, ZUS98], we identified the interdisciplinary skills that are required to address the issues and to perform the functions listed in Table 1; these are shown in the “knowledge” column.

Table 1: Knowledge Requirements in Software Reliability Measurement /
Issue / Function / Knowledge
1. Goals: What reliability goals are specified for the system? / Analyze reliability goals and specify reliability requirements. / Reliability Engineering
Requirements Engineering
2. Cost and risk: What is the cost of achieving reliability goals and the risk of not doing so? / Evaluate economics and risk of reliability goals. / Economic Analysis
Risk Analysis
3. Context: What application and organizational structure is the system and software to support? / Analyze the application environment. / Systems Analysis
Software Design
4. Operational profile: What are the criticality and frequency of use of the software components? / Analyze the software environment. / Probability and
Statistical Analysis
5. Models: What is the feasibility of creating or using an existing reliability model for assessment and prediction, and how can the model be validated? / Model reliability and validate the model. / Probability and
Statistical Models
6. Data requirements: What data are needed to support reliability goals? / Define data type, phase, time, and frequency of collection. / Data Analysis
7. Types and granularity of measurements: What measurement scales should be used, what level of detail is appropriate to meet a given goal, and what can be measured quantitatively, qualitatively, or judgmentally? / Define the statistical properties of the data. / Measurement Theory
8. Product and process test and evaluation: How can product reliability measurements be fed back to improve process quality? / Analyze the relationship between product reliability and process stability. / Inspection and
Test Methods
9. Product Reliability and Process Quality Prediction:
What types of predictions should be made? / Assess and predict product reliability and process quality. / Measurement Tools

Descriptions of Measurement Functions and Knowledge Requirements

The following are brief examples of the functions and knowledge requirements listed in Table 1. Our purpose is to explain to the software engineer the functions and the knowledge requirements that address the issues in Table 1. In addition, we describe techniques that can be employed to implement each of the functions. For example, with respect to Issue 1, we could interview key personnel and examine documentation in carrying out the function of “analyzing reliability goals and specifying reliability requirements”.


Issue 1: Goals

What reliability goals are specified for the system?

A quality requirement is defined as "A requirement that a software attribute (i.e., reliability) be present in software to satisfy a contract, standard, specification, or other formally imposed document" [IEE98]. We analyze the software reliability goals of the organization in order to understand how to specify the software reliability requirements. This analysis could consist of interviewing key personnel in the organization and examining documentation that addresses reliability goals. For example, a goal in a safety critical system could be that there would be no fault and failure that would jeopardize the mission and cause loss of life.

Reliability Engineering

Metrics are identified and data collection plans are developed for satisfying reliability goals. Criteria are identified for measuring and interpreting conformance with reliability requirements during inspection and testing. A complete methodology for implementing a software quality (e.g., reliability) metrics plan in an organization can be found in IEEE Standard 1061 [IEE98].

Requirements Engineering

One of the factors in specifying requirements is to assess the risk of introducing a new requirement into the system or changing an existing requirement. Risk is present because the introduction of a change may decrease the reliability and maintainability of the software. For example, the NASA Space Shuttle Flight Software organization performs a risk analysis on all requirements changes. To assess the risk of change, the software development contractor uses a number of risk factors, such as the size and location of the change, criticality of the change, number of change modifications, resources required to make the change, including personnel and tools, etc. No requirements change is approved by the change control board without an accompanying risk assessment. During risk assessment, the development contractor will attempt to answer such questions as: “Is this change highly complex relative to other software changes that have been made on the Shuttle?" If this were the case, a high-risk value would be assigned for the complexity criterion. A useful book on software requirements analysis and specifications is by Davis [DAV93].

Another approach to getting requirements right is to use rapid prototyping. The rationale is that a quick development and implementation of a sample of the software system will uncover flaws in the requirements prior to investing large amounts of time and money in the development of the system [DAV93].

Issue 2: Cost and risk

What is the cost of achieving reliability goals and the risk of not doing so?

For example, we would evaluate the tradeoff between increases in levels of reliability with increases in testing, which increase the cost of testing.

Economic Analysis

A common error in economic analysis of software systems is to limit the evaluation to total cost. An important criterion is marginal cost in relation to the marginal benefit that accrues by adding an increment of reliability to the software. For example, continuing with the reliability example, this involves comparing the marginal increase in reliability that is obtained through increased testing with the marginal increase in the cost of testing. Theoretically, testing can stop when the two quantities are equal. However, because it is difficult to convert increases in reliability into dollars, we use the concept of the maximum Reliability/Cost Ratio (RCR), which is the ratio of relative increase in reliability to relative increase in cost, as given by the following equation:

RCR = max [(ΔR/Rt)/(ΔC/Ct)], (1)

where ΔR is the change in reliability from test time t to t+Dt, Rt is the reliability at time t, ΔC is the change in cost of testing from t to t+Dt, and Ct is the cost of testing at t, for increasing values of t [SCH971]. Testing continues until the criterion of equation (1) is achieved. The concept of marginal cost is covered in courses on microeconomics.

Risk Analysis

Continuing the reliability example, criterion (1) may be insufficient for some systems. In a safety critical system, for example, it may be necessary to achieve a higher reliability level. In this case, additional risk criteria may be invoked. If we define our reliability goal as the reduction of failures that would cause loss of life, loss of mission, or abort of mission to an acceptable level of risk, then for software to be ready to deploy, after having been tested for total time t, we must satisfy the following criteria [SCH971]:

1) predicted remaining failures r(t)<rc, (2)

where rc is a specified critical value , and

2) predicted time to next failure TF(t)>tm, (3)

where tm is mission duration.

For systems that are tested and operated continuously, t, TF(t), and tm are measured in execution time. If the criteria (2) and (3) are not met after testing for time t, we continue to test until the criteria are met or until it is no longer economically feasible to continue to test.

Issue 3: Context

What application and organizational structure is the system and software to support?

This function involves analyzing, identifying and classifying the application domain (e.g., safety critical system, COTS product). The application domain has implications for reliability requirements. Reliability and safety would be major considerations for a safety critical system whereas time to market could be the major consideration for a COTS product with reliability having a lesser role.

Systems Analysis

A system is defined as "An interdependent group of people, objects, and procedures constituted to achieve defined objectives or some operational role or performing specified functions. A complete system includes all of the associated equipment, facilities, material, computer programs, firmware, technical documentation, services, and personnel required for operations and support to the degree necessary for self-sufficient use in its intended environment" [IEE96]. Thus, a system is comprised of more than software. In order to specify reliability requirements intelligently, we must understand the context in which the software will function. Although it is not necessary to know how to perform system analysis, it is important to appreciate the influence of different hardware configurations – standalone, network, distributed, fault-tolerant -- on software reliability requirements. For example, a network would have higher reliability and availability requirements than a standalone because unreliable operation in a network could affect thousands of users simultaneously, whereas it would affect only a single user at a time in a standalone operation. In addition, software engineers must be sensitive to the quality of service needs of the user organization.

Software Design

Increasingly, software designs (and analysis) are being expressed as object oriented paradigms. In these situations, it would be important for software engineers to learn the Unified Modeling Language [MUL97] and its various design diagrams (e.g., use case, scenario, sequence, state-chart), not only as a design tool but also as a means of identifying where critical reliability requirements exist in the system. For example, describing user interactions with the objects of a system with use case scenarios and identifying the states and state transitions with state-charts can assist in identifying the operational profile (see below).