An empirical study of introducing the Failure Mode and Effect Analysis technique to Norwegian business critical software developers
Department of computer and information Science,
Norwegian University of Science and Technology
Torgrim Lauritsen, Tor Stålhane
Telephone: 73594427, Fax: 73594466,
{torgrim.lauritsen, tor.stalhane}@idi.ntnu.no
Abstract
This article describes an experiment with three Norwegian IT companies, who develop business critical software. The goal of the experiment was to evaluate if it is beneficial to use safety analysis techniques when developing business critical software. The participants in the experiment tried to identify possible failure modes from a class diagram. Half of the participants used the Failure Mode and Effect Analysis (FMEA) method that is widely used in the development of safety critical systems, while the other participants used ad hoc brainstorming. The number of failure modes is used as an indicator for the effectiveness of each technique. Our experiment showed that the participants that used ad hoc brainstorming wanted a method that could help them to reveal more problems. The participants who used the FMEA method found the method useful because it was easy to understand and helped them to identify failure modes in a structured way.
1. Introduction
In the current business climate, companies of every industry, large or small, must have some kind of data protection as part of their business continuity plan [1]. It is therefore important that software developers consider how they can reduce product risk in the software, so that their customers can avoid loss of assets, such as vital information, reputation and money.
The extensive use of computers and software has drastically improved the functionality and efficiency of many companies, but has also made software systems a significant risk factor for those companies [2]. Risk is defined as the product of an event’s consequence and its probability of occurrence or as its hazard level (severity and likelihood of an occurrence) combined with 1) the likelihood of the hazard leading to an accident and 2) hazard exposure or duration [3].
Our starting point is to look at safety analysis techniques that are used to assess the risk associated with using the system, and to prevent accidents from happening in the system. The techniques analyse why accidents occur; that is, the mechanisms that drive the processes leading to unacceptable losses, and they determine the approaches we can take to prevent such accidents [4].
Just as for general safety, business-safety is not a characteristic of the system alone – it is a characteristic of the system’s interactions with its environment. Safety is freedom from unacceptable risk of physical injury or damage to the health of people, damage to property or to the environment [5]. Business critical software is safe when it does not fail in such a way that it causes a mishap [6], which results in loss of financial assets, such as reputation and business interruption. The second biggest threat to business is reputational risk, while the biggest threat is business interruption [7].
We designed an experiment where we wanted to compare the Failure Mode and Effect Analysis (FMEA) technique to ad hoc brainstorming. Our goal was to study which effect the FMEA have on the process of developing business critical software. In the experiment we asked the participants to identify possible failure modes in a system based on a class diagram. The identified failure modes can be used further in the development phases as additional safety requirements and as a basis for testing, and to mitigate or eliminate the failures by building in compensating efforts like redundancy, alarms or barriers that helps to avoid the failures to arise.
We have to be aware of the fact that software may be highly reliable and correct [4] but still be unsafe if the software:
· correctly implements the requirements but the specified behavior is unsafe from a system perspective;
· requirements do not specify some particular behavior required for system safety (that is, they are incomplete);
· has unintended and thus unsafe behavior beyond what is specified in the requirements.
Unfortunately, meeting safety requirements is not a simple matter such as meeting a set of written specifications [8]. The design effort needed to make a system safe is one of a series of coordinated activities needed to assure that the final product will be safe. We believe that developers who develop business critical software must, in addition to satisfying the functional requirements, also add safety requirements to their solution, [9, 10], or else, the software will undermine the prospects for creating value and delivering profits to businesses [7].
The rest of this paper is organized as follows: First we give a short description of the FMEA technique. Thereafter we describe the experiment and the results from the experiment. Finally we conclude the paper and discuss some further work.
2. What is the Failure Mode and Effect Analysis (FMEA)?
The Failure Mode and Effect Analysis (FMEA) is a method that is widely used for reliability analysis of systems, subsystems, and individual system components [11]. FMEA was introduced in 1954, and formalized in 1968. FMEA has been used with success for many years in safety-critical systems like avionics, trains, and nuclear plants and for the process industry. FMEA allows a systematic analysis of possible hazards and failures, and also allows us to assess the effects of these hazards and failures on the components of a system.
In object oriented software development this can, for instance, be classes and their methods [12]. A method is formally a part of the object structure and as long as all methods of an object are executing in accordance with their specification, the object has not failed. Conversely, when a method does not execute in accordance with its specification, the object has failed. The failure effect will depend on the conditions under which the method failed. For example, look at the class diagram shown in figure 1, where objects are uniquely characterized by their methods. Analysing and searching for failure modes in a class diagram using FMEA is done by filling out the FMEA table shown in table 1.
Class / Method / Failure mode / Effects of failure / Action or barriers / SeverityCustomer.
creditRating() / creditRating is too high / Customer places orders for more than he can pay for / 1) Manual check when setting or changing credit rating
2) Implement function to obtain credit rating from external sources / High
creditRating is too low / Customer is not allowed to buy as much as he wants and can pay for / Medium
No
creditRating / The company can lose a lot of money selling goods to customers who will not be able to pay for them / High
Table 1. A FMEA table for creditRating()
In the FMEA table we start with identifying what class and which method we are going to analyse. Thereafter we try to identify the possible failure modes. In this example, for the creditRating() method, we found three failure modes: the credit rating is too high, the credit rating is too low and no credit rating is performed. In the next column we try to see what effects these failure modes can have. In the next column we try to identify possible actions and barriers (countermeasures) to avoid that these failure modes can arise. Last, but not least, we need to prioritize the identified failure modes in such a way that we know which one is the most critical, so we know where to start.
The FMEA method is easy to understand and easy to use. The developers will be able to identify and document possible failure modes, and will be able to implement failure mitigation solutions based on the action and barriers in the FMEA table, which will help to avoid asset losses and thus lead to more business safe software. In design, FMEA serves two roles. Firstly, it helps us to identify possible hazards and failure modes associated with the system. Secondly, it helps to verify that all failure modes leading to hazardouse events or mishap are mitigated by the design modifications made to the system [7].
The most important part of the FMEA process is a systematic walk-through of components to identify possible failure modes such as; “fails to operate on demand”, “calculates a wrong result”, etc. Since each failure can produce a different effect, depending on the level at which it is detected, it is important to do an analysis of each method in a class. Using FMEA will not make it cheaper to develop software, at least not in a short term perspective. Applying FMEA to increase the products’ business-safety must be viewed as an investment. The return of investment will be software products with higher quality, which again will lead to more business from existing customers and new business from new customers. In addition, we will have less need for fire-fighting. The workload will be larger in the beginning of the project. This bigger workload will reduce the rework needed in the project, since latent hazards are identified and the developers can use their new knowledge to limit, reduce or eliminate them.
3. The experiment
3.1 Research approach
We wanted to evaluate the effect FMEA could have in a business critical software development environment. Our experiment was designed as an exploratory and qualitative study. The goal of the experiment was to see if the participants would
· be willing to use the FMEA technique instead of their current ad hoc brainstorming to help them develop more business safe software.
· profit from using the FMEA technique.
· involve the customers in the FMEA.
· be convinced that using the FMEA technique leads to more business-safe software.
The experiment was executed during June 2005. We executed the experiment in three Norwegian IT companies. Two of the companies are IT consultancies, and the third is a privately held company that has its own software development department. In each company we used four software developers that have worked in the IT industry for two to thirty years.
All of the participants are familiar with the Rational Unified Process (RUP), and most of them work in accordance with that methodology in their daily work. Only one of them was familiar with agile methods, and uses test driven development in his daily work.
We have the following research questions:
RQ1: Did the FMEA help the developers to find more failure modes?
RQ2: Did the developers find FMEA useful?
RQ3: Did the developers believe that they would profit from using FMEA?
RQ4: Did the developers want to involve the customers in the FMEA work?
RQ1 can give us an indication of how useful FMEA is when we want to identify possible hazards and problems compared to the techniques the developers use today. RQ2 gives us a subjective answer of how effective the participants felt the FMEA was, and we will compare these answers to possible issues the ad hoc brainstorming group missed in the experiment. We know that introducing a new technique like FMEA in the software development will lead to an extra learning effort and more work. In RQ3 we want to see if the FMEA participants would use FMEA despite the fact of the increased work effort. Based on the success in XP where they want to have the customer on-site the whole time, we wonder if the customer could help by participating in the failure mode analysis since they have the domain knowledge – RQ4.
We offered a short introduction into safety analysis, and a copy of this article as compensation to the companies involved. We emphasized that all answers and other information would be treated as strictly confidential.
3.2 Experimental methods and procedures
In each company we started the experiment by dividing the four software developers into two groups - later called A and B - with two persons in each group. We gave group A an introduction to safety analysis of design diagrams, while group B – the FMEA group – filled in a background questionnaire. When group A was finished with the introduction and group B had filled in their questionnaire with background information, the groups switched tasks. Group B got an introduction of the FMEA technique in addition to the importance of considering safety issues during the software development, while group A filled in the background questionnaire.
In both cases we showed the participants the class diagram in figure 1 and guided them through an example of the analysis based on the creditRating() method in the Customer class.
Group A received a list of possible failure modes and consequences for the creditRating() method:
· The credit rating is too high, i.e. customers can order more goods than they can afford to pay for.
· The credit rating is too low and customers might feel rejected and go to an-other store.
· No credit rating is performed, which can lead to huge economic losses for the company, since they could sell products to customers that are unable to pay for the products.
In addition, we mentioned possible countermeasures such as manual checks, obtain credit information from external sources, etc. Group B got the FMEA table shown in table 1, together with a detailed walkthrough of the table.
After the introduction and the completion of the background questionnaire, we asked both groups to identify possible failure modes and consequences when customers purchase goods from a company based on the class diagram in figure 1.