A Simplified Approach to Fault Tolerant State Machine Design Forfor Single Event UpsetFault Tolerant Designs Techniques for Asynchronous Single Event Upsets within Synchronous Finite State Machine Architectures

Melanie Berg

Principle Design Engineer, Ball Aerospace & Technologies Corp.

1

1Abstract

As Integrated Circuit (IC) geometries become smaller and core voltages scale down, the probability of incurring system faults increases significantly. Errors occur when charged particles penetrate a memory cell and cross a junction , creating an aberrant charge that changes the state of the bit. Based on the speed of the specified FPGA or ASIC circuit, and the geometry of the employed technology, transistor level This is not a new problem for the IC’s targeted for space flight“Hardened by Design” techniques may not sufficiently meet design requirements. In order to rival IC advancements, FPGA and ASIC designers will have to carefully consider architectures that contain some degree of gate-level fault tolerance (mitigation). Fault tolerance is defined as masking or recovering from erroneous conditions in a system once they have been detected.

. However, due to the type of IC advancements, ground based designs will increasingly have to consider architectures that contain some degree of gate-level fault tolerance. Fault tolerance is defined as masking or recovering from erroneous conditions in a system once they have been detected.

This paper will address Single Event Upsets (SEUs) within edge-triggered D-Flip-Flops (DFFs) and assumes that the upsets are soft (correctable by the following clock edge). This is a fair assessment due to the low probability of having multiple errors occur within one clock cycle.

Due to the radiation effects in space, the Aerospace industry has always had to design with SEU (Single Event Upset) considerationmitigation. As far as gate-level DFF protection is concerned, Triple Mode Redundant (TMR – voting) logic is the most commonly used scheme to combat SEUs. However, TMR can be very area extensive and - in a turbulent environment – may not fully erase the probability of upsets. As a solution, many error-coding techniques have been proposed as a compliment (or replacement) to TMR, however due to their complexity, they are rarely implemented.

Interestingly, the theory of Fault Tolerance (or Mitigation) is very extensive. However, very seldom is it emphasized that errors are not only random but also asynchronous to circuitry. Unfortunately, the theory does not cover how to actually implement FPGA or ASIC designs that can correctly detect asynchronous errors without worsening the fault (turning a SEU into a SEFI -single-event functional interrupt). The problem arises when errors occur near clock edges and due to the difference in routing delays (and perhaps glitchy mitigation circuitry) the detection logic may be seen by some DFFs but not by others. Or the worse case scenario being that the asynchronous error-event can set off a chain of metastability. Such a scenario will have a very low probability of occurring in a slow and/or simple architecture such as a shift register. However, as the clock frequency increases and the DFF fan-out significantly increases as in counters or complicated Finite State Machines (FSMs), then the probability of faulty transitions increases drastically.

This paper will address Single Event Upsets (SEUs) within edge-triggered D-Flip-Flops (DFFs) and assumes that the upsets are soft (correctable by the following clock edge). It will also be shown that if the designer does not take into account the asynchronous nature of the SEU, a probability of incurring a SEFI increases. Additionally, This paper will present ansimplified approach to fault tolerant state machine design starting from architectural development through synthesis will be proposed. Examples of coding schemes that include additional logic for error detection and (in some cases) correction such as One- Hot, Sequential, and Hamming will be examined. Due to the fact that users have run into roadblocks with synthesis tools “optimizing” away necessary logic for error handling, special attention will be given to the Precision and Leonardo Spectrum synthesis se tools concerning the necessary techniques involved in producing the correct realization of functionality.

2Motivation for State Machine Fault Tolerance

DFFs play an important role within synchronous designs because, when contained within one clock domain, they act as deterministic timing boundaries. Such a design strategy increases the verification coverage of a circuit and enhances production. Synchronous state machines use DFFs to hold its current state value.

State Machines are generally used as controllers and are at the heart of most designs. If a synchronous state machine is not designed to accommodate Single Event Upsets, the circuit can become locked or produce unpredictable behavior until a reset is generated. Unfortunately, waiting for a system reset may not be a suitable solution. Thus, designers should be aware of techniques for error detection and perhaps correction specifically for erroneous state conditions. It is important to note that current synthesis tools are specifically geared towards area and timing optimization. Their algorithms will want to “erase” redundant schemes for fault tolerance that the designer places within the HDL code. Therefore, the designer must also be familiar with the synthesis package of choice and apply the necessary directives for preserving redundant logic.

3Synchronous State Machine Implementation

Generally, state machines are utilized as controlling mechanisms within a design. They determine when signals should turn on or off, when to implement (or stop implementing) a function, how long to wait for events, etc…. However, state machines are not necessary to implement such functionality. As a matter of fact, using state machines has the tendency to increase the number of total gates within a design. In addition, if operating a state machine in a disruptive environment (such as a space mission), the entire system can lock up into an unreachable condition. So why use state machines?

Utilization of state machines affords the designer: an easy to follow design methodology, manageable design reviews, and a hook into a systematic verification process. It also alleviates the propensity of creating “spaghetti “ (out of control) designs.

3.1Structure

A synchronous state machine is designed to deterministically transition through a pattern of defined states. A state is represented by a register (set of DFFs) and is referred to as the “current state.”

Figure 1: Traditional State Machine Logic Flow

The structure consiststructure consists of three parts:

1.Current State Register: Register of n-bit DFFs used to hold the current state of the machine. It changes state only by a clock edge.

2.Next state logic: Combinatorial logic used to generate the next stage (state) for the machine. The next state value is a function of the state machine’s inputs and its current state.

3.Output logic: The output can be purely combinatorial or registered (generally it is preferable to register the logic). The outputs can be a function of the next state and/or current state (and perhaps the direct inputs) of the machine.

3.2Encoding Schemes

Each state of a state machine must be mapped into some type of encoding (pattern of bits). Once the state has been mapped, it is then considered a defined (legal) state. Depending on the encoding scheme and the number of states within the design, there can be “unused” encoded patterns – i.e. there is no defined mapping from the encoded pattern to a state. In such a case, if there is a fault within the circuitry and the state machine jumps into one of these unmapped (illegal) states, then the circuit has the potential to become locked (or stuck) in some unreachable condition (undefined state).

The current most popular encoding schemes used by designers are: One Hot, Binary, and Gray. The reader should be aware that there are many other schemes, however, this paper will only address the three – plus a special encoding technique specifically for correction.

Within each encoding scheme, there is the possibility of containing “illegal” (unmapped) states. It is very important to note that most designers who use some sort of HDL (Verilog or VHDL) will use a “default” or “others” clause respectively within their case statement and assume that if an illegal state is reached, the state machine will have synthesized gate logic that will return the machine to a deterministic “good” state. This is a false assumption. Synthesis tools will ignore the default (or others) clause due to the amount of additional logic required for implementation. The designer must manually take care of recovery from illegal state transitions either by using a special synthesis directive (safe) or by manually implementing additional error logic.

3.2.1Binary

The binary encoding technique maps states into a base 2 counting scheme. . Example: for a three-state state machine the encoding can be as follows: STATEA: 00; STATEB: 01; STATEC: 10. The unused encoding pattern is 11 and is therefore an illegal state.

3.2.2Gray

Gray is similar to binary. However, transitions from state to state will only differ by one encoding bit. Such a state machine can be very tedious to design when there are multiple branches per state. Example: for a three-state state machine the encoding will be as follows:

STATEA: 00; STATEB: 01; STATEC: 11. The unused encoding pattern for this scheme is “10”. However, if it is necessary to cycle through the states (i.e. return from STATEC to STATEA), the designer will have to implement the extra state mapped to “10” in order to keep the transitions truly Gray – remember a transition from an encoded state of “11” to “00” is illegal (2 bits have changed).

3.2.3One-Hot

The Oone-Hhot encoding requires only one bit to be turned on at a time, i.e. each state is mapped to one bit. Example: for a three-state state machine the encoding will be as follows:

STATEA: 001; STATEB: 010; STATEC: 100. The unused encoding patterns (or illegal states) are: 000, 011, 101, 110, and 111.

One Hot state-machines require more registers than Binary or Gray thus ultimately have more “illegal states”. However, they are usually synthesized into shift registers and because no extra levels of decoding are necessary, One- Hot is the fastest encoding scheme. If the design requires fast paths, one hot is generally the method of choice.

4Fault Tolerant State Machine Design

The main objective of a fault tolerant state machine is to be able to detect an error (a flipped bit in the current state registers) and have a deterministic response within a deterministic time frame. The definition of the response and its response time is dependent on the design requirements. For example, the response may be as simple as indicating that an error occurred and waiting for a signal to bring the state machine to a known “working” state… or the response may be as complex as automatic correction of the error within one clock cycle. Recently, there has been a surge of designers using the “safe” options offered by several different synthesis tools. They have defined a safe state machine as one that will always transition to a known state - i.e., if an SEU occurs and an illegal or unmapped state is reached, one will recover to a known state. Although this scheme may appear to be “fool-proof”, it is not. The compilers are geared towards implementing the “safe” option with binary encoded state machines. In such a case there exists a false sense of safety. For example: if an SEU transitions the state machine (one of the registers gets hit and flips) into a mapped (legal) state (however it is an illegal transition to that state) an extreme fault can occur – there can be severe output behavior with no error detection. An alternate approach beyond using the synthesis “safe” directive must be taken to ensure trust worthy fault tolerance. This paper will first address single-bit error detection techniques. Afterwards, a robust single-bit error correction scheme along with a new encoding technique will be presented.

4.1Single-Bit Error Detection and Recovery within One Clock Cycle

This section will focus on detecting that an error has occurred within one of the state machine DFFs and then recovering from the fault. Remember that it is necessary for the designer to manually provide recovery logic – synthesis tools do not provide an automatic “bucket” of illegal states without the designer using specific directives. This statement directly pertains to the use of HDL CASE statements and their default (or others) clause. All synthesis tools ignore this clause during state-machine gate-level production if the “safe” directive is not used.

4.1.1One- Hot

The beauty of the one hot encoding style stems from the fact that each state has a hamming distance of 2 (it takes 2 transitions to get from one state to another) thereby it inherently has SEU error detection. During normal operation of a one hot state machine, only one bit is turned on – indicating which state is current. Thus the current state register should always have an odd parity. Each transition requires 2 bits to flip (one bit turns off while one bit turns on). If there is an SEU within the state machine, only one bit will flip. The parity will then switch to even.

Example: Using the “three state” example given in section 2.1.1, assume the circuit is in STATEA (“100” - odd). AN SEU can cause the state machine to have the encoded pattern of “000”. However,“000” is not mapped to any state and is easily detected because it has even parity. It is thus considered an undefined or illegal state for the state machine.

4.1.2Using the “Bucket” Approach

Most designers believe that they are implementing a bucket-approach to fault tolerance when using the default clause (VHDL- when others) within a CASE statement. However, the synthesis tools ignore these statements when synthesizing state machines. Most designers are not aware that these statements (that suggest “bucketing” unused states and supplying a transition out of a fault condition) have no bearing on the actual gates being created. The reasoning behind this is that if the synthesis tool were to implement all unused states, the required area would be excessive.

Synthesis tools have a directive that the designer can use called “safe”. When applying this directive to a state machine, the tool can produce a bucket of illegal states. However, the designer must be forewarned that using the “safe” directive has many flaws:

1.It generally creates more gates than desired, i.e. buckets of illegal states can become very large.

2.Synplify will implement a “safe” one-hot but Leonardo and Precision will only implement a sequential state machine (binary or Gray encoding) while using the “safe” directive. Unfortunately, Sequential state machines can be faulty by having an undetected transition to a mapped (legal) state upon an SEU occurrence.

The designer must consider the list of potential problems before using the safe directive as the method for fault detection.

4.1.3Using Multiplexer Control for Next State Transitions

The most efficient means of SEU error detection and recovery in one-hot state machines is the use of a combinatorial parity checker over the one-hot current state register set. The designer would use the output of the error detection logic to either place the state machine into its normal operational next state at the following clock edge or to set the machine into a designated error state (could also be as simple as going back to a reset state) upon detecting an SEU. The fact that the error detection is purely combinatorial logic is a plus because it contains no DFF’s and thus the detection logic cannot create a fault of its own.

Figure 2: One-Hot State Machine with Error Detection

The following is an example in VHDL code of how to implement the parity checking error detection across a 4 state Oone- Hhot state machine:

type FSM_states is (IDLE,FSM1,FSM2,FSM3);

signal FSM: FSM_STATES;

signal next_FSM: FSM_STATES;

attribute TYPE_ENCODING_STYLE OF FSM_STATES : type is ONEHOT;

signal e : std_logic_vector(3 downto 0);

signal error : std_logic;

BEGIN

--map enumerated type into std_logic_vector for error

-- xnor function

e(0) <= '1' when fsm = idle else '0';

e(1) <= '1' when fsm = FSM1 else '0';

e(2) <= '1' when fsm = FSM2 else '0';

e(3) <= '1' when fsm = FSM3 else '0';

-- Next state combinatorial process

process(FSM, intrans)

begin

case FSM is

when IDLE =>

if (intrans = '1') then

next_FSM <= FSM1;

else

next_FSM <= IDLE;

end if;

when FSM1 =>

next_FSM <= FSM2;

when FSM2 =>

next_FSM <= FSM3;

when FSM3 =>

next_FSM <= IDLE;

when others =>

next_FSM <= IDLE;

end case;

end process;

error <= xnor_reduce(e);

-- Sequential process producing the current state DFFs

-- and the output DFF

process (sysclk,reset)

begin

if (reset = '0') then

FSM <= IDLE;

outsig <= '0';

elsif rising_edge(sysclk) then

-- Mux infront of the state machine… upon error

-- go to IDLE otherwise take the next state

if (error = '1') then

FSM<= IDLE;

else

FSM <= next_FSM;

end if;

if FSM = IDLE then

outsig <= '1';

else

outsig <= '0';

end if;

end if;

end process;

Note: The user must declare the state machine as one-hot to ensure that the encoding is as expected (see the attribute statement in the VHDL example). If coding in VHDL, extra coding must be included (see the signal named “e”) in order to map an enumerated type into ana std_logic_vector from and perform the XNOR error check (error). The “e” signal is understood as a type conversion and no extra logic is synthesized. Because the state machine is one hot, there is a one to one correspondence of the e vector and the FSM… i.e. e(0) <= FSM(0), e(1) <= FSM(1), etc…