Software Safety

Software Safety

cns07uChloe Sanderson4073145

Software safety

Introduction

Electronic systems have become a part of everyday life and have helped achieve some goals that would not otherwise have been possible. However, not controlled properly, a lot of electronic equipment can easily become very dangerous. For example a warning light failing to come on, on an aeroplane dashboard may cause the plane to crash causing huge financial loss as well as large loss of life. There are a number of different techniques which can be employed to minimise or stop safety hazards and increase overall safety.

Defining Software Safety

Software safety is concerned with avoiding hazardous situations and alerting the correct systems if the situation becomes unsafe. There are thousands of reasons a hazard can occur so creating a system which avoids these is a big task.

A hazard is not an accident in itself; it is a factor which could potentially lead to an accident if not reacted to. For example a car with a faulty break may not crash, it could rely on its ‘back up breaks’ (were they to exist), but the faulty break is a hazard which could cause a crash later on if the back up break were to fault as well. Different configurations of a number of single hazards may cause an accident where as a different configuration may not.

Once we have identified and defined a hazard there are a number of degrees to which we can fix the hazard. We could completely remove the hazard which would stop the hazard from causing an accident on its own or with other hazards that may occur at the same time. We could also just reduce the probability that the hazard will occur. If we reduce the probability of the hazard occurring to a very minimal level then it we can treat it as though we have removed it completely.

We can also work on contingency plans in case the very small probability of a hazard turning into an accident actually occurs. By reducing the hazard’s negative impact we are allowing what could be precious time to implement a contingency plan which may not be instantaneous. We could also significantly reduce the damage caused by the accident, potentially reducing the loss of life or other commodity such as time or money.

There are two types of safety-critical software:

  • Primary safety-critical software – Malfunctioning of this kind of software could cause direct human or environment damage
  • Secondary safety-critical software – Malfunctioning of this software could cause indirect human or environment damage. For example if a drug dispensing machine gave out the wrong drugs to someone, the system itself would not cause damage, but it would be the reason that damage was caused.

Where do the problems come from?

The causes of these hazards can be many, and it can often be easy to overlook a potential source. As with general software faults, it is easier and cheaper to identify and make allowances for these hazards earlier on rather than halfway through the software development cycle.

When writing a software definition it would be very easy to miss or not recognise a potential hazard. Even this simple mistake could cause huge problems later on. The definition of the software could also be incorrectly or ambiguously defined. This may lead people reading the specification later on to make assumptions or decisions which could easily be wrong.

A lot of software these days has been written with reused software modules from previous projects. While this saves time and money, if the software is not properly checked or documented an accident could occur. There is the famous example of the Mars Climate Orbiter (MCO) reusing a code module from the Mars Global Surveyor (MGS). Unfortunately, the code had not been thoroughly documented and the team reusing the code had not checked it well enough to spot a conversion from imperial to metric within an equation, causing the MCO to be lost. (1)

As well as an individual module’s code needing to be correct, the interaction between the module, other modules, and the rest of the system needs to be known to be correct. With the different modules for a program likely to be run in parallel on the same processor, it is possible that there may be some undesired effects which could be hard to replicate. This will need to be considered early on in the design stage in order to rule out the problem.

The team writing the software definition could be unsure of the specification of the equipment. For example if two teams were working at the same time, one on developing the code and the other on developing the hardware, there could be a lack of communication between them causing hazards to be unintentionally created. If the team working on the hardware was to be creating two mechanical arms and the software team writing the code to move the arms, the software team could accidentally make the arms crash through not being aware of the technical specification of the arm.

There could be a simple fault within the coding, or a piece of hardware could unexpectedly break. The code can be tested and checked, but in the case of hardware breaking we would need to have a back up system in order to cope with this.

A system could be reliant on a human reaction or monitoring which may lead to an incorrect judgement. Humans cannot be tested to be ‘correct’ in the same way a system can, as they have free will and many other contributing factors.

As you can see, there are many places that hazards can come from and this list is by no means exhaustive. Many of the generaltesting techniques we have learnt so far can come in useful to reduce hazards; however they cannot be applied to many situations so other hazard analysis techniques have been created to try and solve these issues.

One suggested technique has been to class the hazards into four specific areas:

  • Inherent Hazards - A software controlled function which is inherently hazardous due to the hazardous nature of the equipment or process being controlled, such as hazardous materials or energy sources.
  • Timing Hazards - Software controlled functions where the timing sequences are safety critical. Time sequences are often wrongly taken for granted to be safe.
  • Induced Hazards - A software hazard caused by some sort of failure eg hardware or software failure.
  • Latent Hazards - A hidden condition in the software design which is not hazardous until an unplanned or untested set of circumstances occur.

(2)

Software safety techniques

Many techniques have been developed to try and reduce hazard numbers and counteract the problems when they do occur.

Formal definition can reduce hazard numbers significantly. Informal definition may identify hazards but not address them to a detailed enough level. An informal definition may also cause hazards in other areas where there previously were none due to details being unclear. A formal definition is made up of mathematical functions which are completely unambiguous and leave no decision making to the individual. This completely stops any problems that could have occurred due to definitions being unclear. (3)

The chance of hazards caused by component failures can be significantly reduced by having back up components in place. From the diagram below we can see that if a component has a 1% chance of failure, we can reduce that chance to 0.01% if we have another component backing the first one up. We only need one component to work and the chance failure can be further reduced by adding in more back up components until the risk of all the components failing is an acceptable one.

We can also use the derating technique, where the device is operated at less than its rated maximum power. Operating a system or component underneath its design limit will make it more reliable. (4)

Fault tree analysis can also be used to visualise potential combinations of hazards which cause a particular event. For example in the diagram below, we would put an accident at the top of the tree – where subsystem A is – and add scenarios that cause the accident through a series logical expressions or logic gate symbols. (5)

The best way to reduce hazard numbers is by simply identifying them. Once a hazard has been identified appropriate action can be taken to deal with it successfully, but an unidentified one can occur and could cause an accident at any time.

Industry Analysis Techniques

A number of hazard analysis techniques have been developed in order to fully understand and resolve these hazards. An example of this is the STAMP technique which was developed at MIT, and is not only for hazard analysis, but also considers organisational factors and the dynamics of complex systems. STAMP has five steps:

  1. Identify the system hazards – identify all of the potential hazards in a system and expand on them to find rough solutions
  2. Identify safety related requirements and constraints – In order to remove the hazard what are the constraints
  3. Define the basic system control structure – Define who is in control at the time of the potential hazard
  4. Identify inadequate control actions that could lead to a hazard – Find out how the system reaches the hazardous state using the control structure defined in part 3
  5. Determine what constraints could be violated and eliminate, prevent or control them through the system design

(6)

There are other methods that have been developed to reduce hazards such as SpecTRM or STPA.The different methods are used for different types of systems. Unfortunately however hazard analysis is an area that needs significantly more research.

Software Safety Standards

There are number of standards that have been developed as guidelines for development and to certify a system being safe. Some of the standards available from ISO are:

  • Health informatics – Classification of safety risks from health software
  • Safety of machinery – Safety-related parts of control systems
  • Space systems – Ground support equipment for use at launch, landing or retrieval sites – General requirements

These standards are designed to make systems and equipment meet a minimum standard so people who use them in the future can ensure quality and know what to expect from what they are buying. Different safety standards have been created for different industry’s requirements.(7)

Bibliography

1. Leveson, Nancy G and Weiss, Kathryn Anne. Making Embedded Software Reuse Practical and Safe. [Online]

2. Ericson, Clifton A. Software Safety in a Nutshell. [Online]

3. Leveson, Nancy. Completeness in Formal Specification Language Design for Process-Control Systems. [Online]

4. Derating for Electronic Components. [Online]

5. Fault Tree Analysis. [Online]

6. Leveson, Nancy and Dulac, Nicolas. An Approach to Design for Safety in Complex Systems. [Online]

7. ISO. [Online]

8. Sommerville, Ian.Software Engineering. s.l.: Pearson Education.