SAND: Safety Assessment For New Air Traffic Concept Exploration

Barry Kirwan, Eurocontrol Experimental Centre

1Copyright © #### by ASME

SUMMARY

There are frequently calls for safety to start early in the life cycle of a system, and in particular at the design stage. But how early is early? And is there a point at which, if too early, there is little real added value, or even a negative effect due to safety ‘constraining’ unnecessarily the creative design process? This paper concerns an attempt to carry out safety assessment during the concept exploration phase of potential new systems for the Air Traffic Management (ATM) industry, based at an ATM research centre. The safety assessment process or framework is called SAND – Safety Assessment for New Designs, and comprises non-quantitative approaches to safety investigation. The main emphasis of SAND is to explore the concepts and determine how to make them safer. At a deeper level, the approach is there to help avoid the occurrence of latent failures and conditions in system design, and to engender a better safety understanding by the designers that will pervade the whole system development life cycle, so that designers and developers themselves are thinking about safety during their work.

The SAND approach or process has three main stages: Scoping; Analysis; and Feed-Forward. This means effectively deciding how much safety the concept project ‘needs’, carrying out the analysis, and then documenting the findings and feeding them forward to later stages and more formal safety assessment processes. The analysis component relies on a small ‘toolkit’ comprising the following techniques: task analysis; human error identification (TRACER); hazard identification (HAZOP); learning from incident experience (Saflearn); learning from real-time simulations (SAFSIM); Human Factors Case; hazard logging (HARTS); and safety requirements documentation (SIDES). The ‘infrastructure’ or safety framework surrounding these safety activities is based on a safety policy and a developing Safety Management System (SMS). However, there is no regulatory oversight nor requirements for safety work at this early stage in the system life cycle. The safety activities themselves are carried out by a small team of safety people who work with the individual project teams.

This paper firstly outlines the nature of ATM system concept development and research, to set the context within which safety assessment must work. It then briefly describes the framework, the techniques used and the types of results and insights that can be gained at such early stages in system design and development. It then considers the deeper issues of the direct and indirect value of such attention to safety at an early stage, and discusses the relative ‘pro’s and con’s’ of such an approach.

Disclaimer: The opinions expressed in this paper are those of the author and do not necessarily represent or reflect those of parent or affiliated organizations

1Copyright © #### by ASME

introduction

The Context – Concept Research in Air Traffic Management

Air Traffic Management (ATM) is going through major changes in response to increasing demands for air transportation. This capacity increase is being enabled via a range of new systems to allow air traffic controllers to handle more traffic in already-busy airspace. Part of this change management process involves continual development and refinement of new concepts to improve capacity, efficiency, reduce delays, and maintain safety. Examples of such new concepts in Europe include new ways to send messages between aircraft and air traffic control (e.g. datalink), new airspace concepts (e.g. functional airspace blocks as opposed to those dictated by national boundaries), new tools to help the controller avoid aircraft losing their required minimum separation, and means to enhance traffic throughput at major airports, etc. [1]. At the EUROCONTROL Experimental Centre (EEC), south of Paris, France, research on such concepts is carried out. This research leads to the definition of an operational concept. In ATM, due to its real-time nature and relatively fast dynamics, the main safety ‘mechanism’ or principal line of defense is in fact the human controller (rather than alarm systems or protective hardware systems as found in many other industries). This means that the controllers’ judgment about the adequacy of new operational concepts carries significant weight. For this reason there is a corresponding strong emphasis on controller involvement in concept development and exploration, the latter typically involving real-time simulations (see Figure 1) in high-fidelity simulators with demanding exercises lasting several weeks.

Figure 1 – EEC (EUROCONTROL Experimental Centre) Main Simulator Room

The Case for Safety Assessment atthe Concept Research Stage

The examination of new concepts for ATM is therefore the context of this paper. The next consideration is the need for safety assessment at this stage in the system development life cycle. Once concepts are mature and consolidated, they will leave the research stage and be developed more formally to result in real operational improvements, and will undergo a full safety assessment and safety case preparation process [2], including quantitative demonstration that any new addition to the overall ATM system and infrastructure does not compromise the target level of safety set for Europe as a whole (the current target is 1.55 accidents per 108 aircraft flight hours). But before this consolidation stage is reached, there is effectively no safety regulator and no regulations applicable. Additionally, the concepts at the research stage are not as concrete as they will later be, and so it can be difficult to carry out extensive fault studies and consider sub-system interactions and common mode failures etc., since there may be insufficient detail available. Therefore two fundamental questions arise:

  1. What are the principal aims of carrying out safety assessment at such an early stage? In other words what exactly is it hoped to achieve?
  1. Can safety assessment at such a stage really add safety value? In other words, can useful insights be gained that persist, such that they remain valid through the later and less formative stages of design, and are seen as useful to the stakeholders (including the designers themselves and controllers involved in system development and acceptance, as well as industrial stakeholders)?

The first question is therefore one of what is intended to be achieved. In these days of tight resources, it is not sufficient to expend resources based purely on a platitude that safety cannot start early enough. Therefore, the EEC commissioned an independent study into the roots of accidents with respect to design [3]. This study investigated a number of industries including nuclear power, aviation and rail, and found evidence of a general effect that approximately 50% of accidents have their roots in the design process. The design process here includes specification as well as design (this is mentioned because in the study ‘inadequate specification’ was a major source of failure). The study revealed many sources of failure, but some key examples of interest are the following:

  • Use outside of the design envelope,
  • Changes of operational context,
  • Failure of defense in depth,
  • Misconceptions between designers and operators,
  • Unexpected failure mechanisms.
  • Incorrect functioning leading to mistrust of safety system by the operator

These are all highly relevant to the ATM system design and development context. For example, in Europe a general concept may be developed, but then may be used in many different European states, that may need to ‘tailor’ it to their system. This could lead to the first and second type of ‘error’ noted above. The third ‘error’ type suggests an inadequate safety philosophy at the heart of the system being developed. The fourth and the sixth are perhaps less likely in ATM since controllers are highly integrated into the ATM system design process, and system developers are acutely aware of the need to manage trust in the controller population (any system that malfunctions systematically is quickly rejected and discarded by controllers). The fifth also suggests inadequate safety assessment, particularly a lack of creative safety investigation or ‘thinking outside the box’ (also known as ‘safety imagination’).

These design-safety ‘failure modes’ therefore suggest the need to focus on safety in the design stage, in particular in the following areas:

  • Exploring and explicitly defining the system’s safe acceptable operational boundaries (its safe ‘envelope’)
  • Avoiding ‘under-specification’
  • Detailed consideration of the barriers in the system and common mode threats to safety
  • Ensuring designers know how controllers think and how they would manage safety in relevant scenarios
  • Harnessing the creative powers in design to consider new failure mechanisms

Additionally, in Reliability Engineering there is concern expressed about ‘latent errors’, essentially failures in the design that can lie dormant for many years until the right operational conditions arise. Such errors or failures could also arguably be identified by a focus on safety at a more formative early design stage.

A further consideration or rationale for early safety assessment concerns the intrinsic difficulties of demonstrably comprehensive safety assessment when addressing highly complex systems. It is usually not possible to state categorically that all failure modes have been found – safety assessment of complex systems is not deterministic, the number of potential events and interactions (and the number of permutations of those interactions) are simply too large.

There are three approaches to dealing with this state of affairs. The first is a quality safety assessment process as exists in many industries that should identify most hazards, and certainly the critical ones. However, particularly with novel system developments, comprehensiveness cannot be guaranteed. The second approach is a quality-driven design process, one that also gives safety its proper place, and ensures that safety principles are embodied rationally in the system specification. This is the approach advocated for example by Leveson [4], and it is a sound approach, though it is more applicable to mature concepts rather than concept exploration or concept research.

A third approach is to develop a culture of safety within the system design community itself. After all, safety assessors are often seen as the ‘police’ of such systems, and cannot be ever-present. In contrast, designers are the progenitors of the system, and are working with it every day. If designers themselves are also concerned with safety, then they are well-placed to detect and raise problems at their source, during the design process. This can only happen if they see it as part of their responsibility, and as part of their competence. This means they must to an extent ‘own’ safety, or at least co-own it with safety advisors, and they must participate in safety assessments themselves, directly. It also means that safety must not be seen as only the job of the safety assessor or as something remote from design, or as something beyond the competence of the designer to follow.

The next section of the paper considers how to actually try to achieve the first and third aims, based on an approach (called SAND) adopted at the EEC over the past two years, moving more towards design-centered safety. The final section then tackles the second main question raised above, by discussing the interim results of the SAND ‘experiment’, and addressing the degree to which such a process adds safety value in a meaningful and sustainable way.

The SAND (Safety Assessment for New Designs) Approach

SAND is a set of inter-related techniques aimed at early identification of hazards and human error problems. The SAND techniques are drawn from existing tried and tested approaches in air traffic and other industries (e.g. nuclear power, offshore petrochemical, etc,). The techniques have the following attributes:

  • Robustness
  • Qualitative, not quantitative
  • Flexibility in terms of depth of system description required
  • Low degree of abstraction (high in ‘concreteness’)
  • Requiring designer involvement
  • Low training needs in terms of safety assessment
  • Low/moderate resources needs in terms of assessor and domain expert (designer/controller) time

The techniques themselves are the following [5]:

Hierarchical Task Analysis (HTA) - a Human Factors technique for describing and documenting what the humans do in a system, in terms of their goals, tasks and operations. HTAs are the Human Factors equivalent of Piping and Instrument Diagrams (P&IDs) or Functional Block Diagrams (FBDs) as used in engineering design and reliability engineering respectively. If you cannot develop a HTA, it means you do not know how the human will operate the system. The degree of detail of a HTA is however flexible.

TRACER (Technique for the Retrospective Analysis of Cognitive Error in ATM) – a human error identification technique (also an incident human action classification technique, hence its name). TRACER uses a set of ATM-specific questions and guideword sets to identify errors and error recoveries that could occur in ATM scenarios. It needs a HTA as its starting point. Typically TRACER is carried out by one or two assessors, working in conjunction with one or two system operation/design experts.

HAZOP (Hazard and Operability Study) – HAZOP is a hazard identification and classification approach that originated in the chemical industries in the mid-seventies. It has a set of guidewords and is basically a ‘what-if?’ technique, using a group of experts run by a chair-person. These guidewords are applied to a representation of the system which can be block diagrams, task analysis, or some other representation meaningful to the group. Each identified hazard is broadly classified in terms of its severity and likely frequency, to give a basic assessment of risk importance.

Human Factors Case – this is a rapid Human Factors auditing approach that starts with a Human Factors issue analysis which, like HAZOP, is run by an expert group (e.g. comprising Human Factors people, designers and controllers). Identified Human Factors issues are then addressed with reference to a body of Human Factors knowledge, or via simulation investigation, to resolve the issues raised.

SAFSIM (Safety Insights in Simulations) – SAFSIM is a pool of techniques and tools that can be used to measure safety impacts during real-time simulations. The measurement techniques may be subjective in nature (e.g. measuring impacts of a new display on situation awareness, trust, or workload), or more objective (e.g. measuring minimum distance between aircraft to see if separation minima were infringed). Sometimes it is possible to go further and actually ‘inject’ safety-related events (called ‘non-nominal events’) into simulations to see how quickly and effectively controllers can detect and neutralize such threat events.

SAFLEARN (‘learning from the past to protect the future’) – SAFLEARN is an approach whereby future design concepts try to see how they would overcome current incident typologies. For example, a new tool for helping controllers detect and avoid ‘conflicts’ (e.g. where aircraft are on an intercept course) before they lead to actual reportable incidents (losses of separation) can be contrasted against current loss of separation incidents, to see what proportion it would actually avoid in principle. SAFLEARN uses a group-based approach (safety experts, incident investigation expertise, designers and controllers) and needs a database of current incidents, and significant knowledge of the etiology (causes; contributory factors, etc.) of those incidents.

HARTS & SIDES (Hazard and Requirements Tracking System; Safety Information Data Exchange System) – HARTS & SIDES are concerned with initial documentation and feed forward of safety information respectively. HARTS is detailed and used by the ‘safety officer’ for a project (operational concept), and is basically a hazard log, defining the hazard, how and when it was identified, its severity and frequency classifications, other related information, and what action is being taken, if any, to resolve it., and any related or derived safety requirements that would need to be considered by the Concept project team. SIDES aims to take all the information from HARTS and synthesize it into a transmissible and memorable form, for later use in the system design life cycle. SIDES is necessary because after exploration of the concept, the concept may be further developed by a different team or organization. The concept may also be changed, and the developers will need to know the safety impact of any such changes they might make.

Figure 2 shows how SAND may be applied [5]. The figure firstly shows two necessary mechanisms that enable the organization of the safety assessment work. The first mechanism is a safety policy, signed by the top management of the research centre, which explains the commitment to safety in its work and research. The second is a safety plan, which can be at an area level (e.g. for research into Airports, or for En Route air traffic, or for airspace design) which ultimately is used to decide which projects receive a safety focus, and what techniques will be applied.