Fault-Tolerance in VHDL Description:
Transient-Fault Injection & Early Reliability Estimation
Fabian Vargas, Alexandre Amory Raoul Velazco
Catholic University – PUCRSTIMA-INPG Laboratory
Electrical Engineering Dept. 46, Av. Félix Viallet
Av. Ipiranga, 6681. 90619-900 Porto Alegre 38031 – Grenoble
Brazil France
Abstract
We present hereafter a new approach to estimate the reliability of complex circuits used in harmful environments like radiation. This goal can be attained in an early stage of the design process. Usually, this step is performed in laboratory, by means of radiation facilities (particle accelerators). In our case, we estimate the expected tolerance of the complex circuit with respect to SEU during the VHDL specification step. By doing so, the early-estimated reliability level is used to balance the design process into a trade-off between maximum area overhead due to the insertion of redundancy and the minimum reliability required for a given application. This approach is being automated through the development of a CAD tool.
Keywords:Fault-tolerant Circuits; Reliability Estimation; Radiation-Exposed Environments; Single-Event Upset (SEU); High-Level Description (VHDL); Error Detection and Correction (EDAC) Codes.
1. Introduction
At the present time, it is generally accepted that the occurrence of transient faults in memory elements, commonly known as single-event upsets (SEUs), are a potential threat to the reliability of integrated circuits operating in radiation environment [1,2,3]. This subject is of considerable importance because SEU occurs in space/avionics applications due to the presence of heavy-energy particles. SEU can also occur at the ground level (due to atmospheric neutrons), which may potentially affect the operation of digital systems based on future sub-micronic technologies. Heavy particles incident on storage elements such as flip-flops, latches, and RAM memory cells produce a dense track of electron-hole pairs, and this ionization can cause modification of their contents, also referred to as single-event upsets or soft errors [4,5,6]. It is interesting to note that Hubble Space Telescope has not only trouble with mirrors. The NASA's biggest space-based astronomical observatory is also struggling daily radiation-induced electronic failure. SEUs in a 1-Kbit low-power TTL bipolar random access memory are causing one of Hubble's crucial focusing elements to loose data on a regular basis. Ironically, the chip has well-documented history of SEU failures. NASA officials knew about them before the craft was launched. Nothing was done, however, because Hubble's orbit takes it through relative benign radiation territory. It is only when the telescope passes through the heavily proton-charged South Atlantic Anomaly over Brazil that problems occur, and NASA engineers have developed a software solution to compensate for the errors [7]. The same effects can be observed with respect to SRAM-based FPGAs and commercial microprocessors, when exposed to this type of radiation [8,9].
One possible technique that can be used to cope with SEU effects is the one proposed in [10,11]. In this work, the authors have proposed a combination of IDDQ current monitoring with coding techniques to cope with SEU-induced errors in SRAM memories. In this approach, the current checking is performed on the SRAM columns and it is combined with a single parity bit per SRAM word to perform error correction. This approach has been provided to be very effective in detecting/correcting SEUs in SRAMs.
Another solution widely employed by designers to cope with the SEU harmful consequences is the use of Error Detection and Correction (EDAC) approaches. In this case, Hamming code plus one parity bit are appended to each of the memory elements (special/general purpose registers or memory words). If we consider memory elements of 64-bit wise, this approach results in an area overhead of 10-15% per memory element. Despite this area overhead, this is the most commonly used approach to “harden” complex circuits to the SEU phenomena [13].
Another technique extensively used by designers to implement SEU-tolerant integrated circuits is the use of process-related enhancements, such as the use of CMOS-SOI/SOS technologies [4,5,6]. In addition, some designers consider the use of hardware redundancy in terms of extra transistors used to implement registers that are more stable to SEU disruptions [4,5]. Even though these approaches are expensive, they are very effective to cope with SEUs.
Despite the fact these approaches present a good success to cope with transient faults in memory elements of complex circuits, one of the most important drawbacks is the fact that the designer needs to go till the end of the design process in order to verify the effectiveness of the approach to handle with transient faults. In other words, the designer needs to fabricate the circuit and test it in laboratory radiation environment in order to validate the design. This requirement has as consequence the dramatic increase of both time-to-market and cost of such circuits if the validation process fails, because all the design process must restart.
In order to cope with this problem, this paper proposes an approach that estimates the reliability level of the circuit on the development in an early step of the design process. This estimation is performed during the circuit specification, at the VHDL high-level description language. If the obtained reliability level with respect to transient faults in memory elements for a given application is the expected one, then the designer can implement the circuit into an FPGA or an ASIC. Otherwise, he remains in the initial step of the design process in order to modify/improve the embedded fault tolerant functions specified for the circuit on development.
At present, a tool that automates the insertion of the coding techniques into the circuit storage elements and estimates the obtained reliability (both steps performed at the VHDL description level) is on development.
2. Single-Event Upset (SEU)
In a CMOS static memory cell, the nodes sensitive to high-energy particles are the drains of off-transistors. Thus, two sensitive nodes are present in such a structure: the drains of the p-type and the n-type off-transistors.
When a single high-energy particle (typically a heavy ion) strikes a memory cell sensitive node, it will loose energy via production of electron-hole pairs, with the result being a densely ionized track in the local region of that element [12]. The charge collection process following a single particle strike is now described. Fig. 1 shows the simple example case of a particle incident on a reverse-biased n+p junction, that is, the drain of n-type off-transistors.
Fig. 1. Illustration of the charge collection mechanism that causes single-event upset: (a) particle strike and charge generation; (b) current pulse shape generated in the n+p junction during the collection of the charge.
Charge collection occurs by three processes which begin immediately after creation of the ionized track: drift in the equilibrium depletion region, diffusion and funneling. A high electric field is present in the equilibrium depletion region, so carriers generated in that region are swept out rapidly; this process is called drift. Carriers generated beyond the equilibrium depletion region width, more specifically, the charge generated beyond the influence of the excess-carrier concentration gradients, can be collected by diffusion. The third process, charge funneling, also plays an important role in the collection process. Charge funneling involves a spreading of the field lines into the device substrate beyond the equilibrium depletion width. Then, the charge generated by the incident particle over the funnel region is collected rapidly. If the charge collected, Qd, during the occurrence of the three processes described before is large enough, greater than the critical charge Qc of the memory cell, then the memory cell flips, inverting its logic state (the critical charge Qc of a memory cell is the greatest charge that can be deposited in the memory cell before the cell be corrupted, that is, its logic state is inverted). Fig. 1b shows the resulting current pulse shape that is expected to occur due to the three charge collection processes described above. This current pulse is generated between the reverse-biased n+ (resp. p+) drain depletion region and the p-substrate (resp. n-well), for the case of an n-well technology, for instance.
3. The Proposed Approach
The proposed approach is divided in two steps: I) according to a built-in reliability functions library, the designer specifies the coding techniques to be incorporated into the circuit; II) then, by using a specific fault injection technique, the designer estimates the circuit reliability. Both steps are performed in VHDL high-level description language.
3.1 Built-In Reliability Functions Library: achieving the desired circuit fault-tolerance
The first step of the approach is based on the incorporation of coding techniques into the original VHDL circuit description. The coding approaches considered are in the form of: (a) Hamming code plus one parity bit per storage element (single registers) to correct single errors and to detect double errors (SEC/DED); and (b) Two-dimensional parity code to be applied to the columns and lines of embedded memory arrays. While the first approach is more suitable for single storage elements (placed individually) into the circuit, for instance a microprocessor (e.g., the Program Counter (PC), Stack Pointer (SP), Exception Program Register (EPR), and Page Table Register (PTR)) the second approach is more suitable for a group of storage elements, such as the BranchPrediction Table and the Address Translation Cache.
Note that if we consider a register placed individually into the processor, then the simplest (and more reliable) way to make it SEU-tolerant is to use the Hamming code plus 1 parity bit approach appended to the original bits of the register. For example, consider a 32-bit register: thus we can implement this approach by appending 7 check bits to the register (6 bits for Hamming + 1 parity bit). The drawback of this approach is the large area overhead required to implement it: around 22% in this case.
Now, if we consider a large area containing several tens or hundreds of registers (like an embedded memory), thus we can minimize area overhead by appending only one parity bit per line and one parity bit per column of the memory array (i.e., the two-dimensional parity approach). For example, if we have a 64 32-bit words memory then we can implement the two-dimensional approach with 96 bits (64 lines + 32 columns). The drawback of this approach is that it may involve speed degradation for large sequences of write operations into the memory (for maintaining the integrity of the words parity).
Compared to the modified Hamming approach, for a given period of time t, note that the two-dimensional parity technique results in lower reliability. This is true because the probability of occurrence of one error in 64 bits of a single line is greater than the probability of occurrence of one error in a single word of 32 bits. Note that the occurrence of the second error in the same memory line corrupts the whole information stored into the memory (in this case, the error is detected, but cannot be localized). In the case of the modified Hamming code, the occurrence of the second error in a given word is detected and cannot be localized either, but in this case, only the information in this word is lost, which confines the error and maintains the integrity of the rest of the information stored in the memory.
This approach is being automated through the development of the FT-PRO CAD tool, whose design flow main steps are shown in fig. 2.
Fig. 3a shows the target block diagram that is generated by the FT-PRO tool after compiling the initial circuit description in VHDL (“High-Reliability HW Part” Block, in fig. 2). This structure generates and appends check bits to the information bits each time the application program writes data into a memory element. Similarly, each time a data is read from a memory element, the data integrity is checked by the Checker/Corrector Block and if a correctable error is found, this block writes back the data into the memory element in order to be read again by the application program. Fig. 3b and 3c presents details of the Parity Generator and the Checker/Corrector Blocks shown in fig.3a. The example shown in these figures is targeted to an 8-bit word processor. Note that parities P1, P2, P4 and P8 used by the Hamming code are computed in parallel, while the computation of the whole word parity (P0) is serial (see fig. 3b).
Fig. 2. Block diagram of the FT-PRO tool being developed to automate the process of generating storage element transient-fault-tolerant complex circuits.
(a) (b) (c)
Fig. 3. (a) Target block diagram generated by the FT-PRO Tool. (b) Parity Generator Block. (c) Checker/Corrector Block.
3.2. Reliability Early-Estimation: injecting bit-flip faults (SEUs) in VHDL code
It is of common agreement the widespread use of high-level description languages to describe hardware systems as software programs. Consequently, a transient fault that affects the hardware operation can be considered as a fault affecting the software execution. In other words, a bit-flip fault affecting the hardware operation (e.g., an SEU-induced fault in a memory element) can have an equivalent representation at the software implementation level. In this section, we present the fault-injection technique we have developed to modify registers “on-the-fly” during VHDL simulation. The fault model assumed is not restricted to single faults, thus any combination of faults can occur in a memory element of the circuit.
The idea in our work is to verify at a high-level description, as early as possible in the design process, the circuit reliability against SEU-induced faults. Of course, except the registers modified with EDAC codes, the other non-modified registers of the circuit are not reliable. Thus any fault affecting the latter registers may lead to a system failure. Therefore, the designer must be aware of this situation and balance the desired reliability level against the required HW cost for the circuit on the design. This situation is specially true in the case of large memory portions of the circuit, such as embedded caches. In this particular case, it can happen that the application can tolerate a given number of errors (bit-flips) in the data cache block of the circuit. This allows the designer to decide not to protect this part of the circuit in order to minimize the area overhead impact due to the insertion of the built-in reliability functions.
Note that the reliability is not only a function of which memory elements have been protected with EDAC codes, but also a function of the application itself. Starting from the point of view that memory elements are checked only when they are used (i.e., read out) in the application, it may happen that after a long period out of use, the memory element can be corrupted with more errors than those that can be handled by the EDAC code associated with that memory element.
The proposed approach works as follows: initially, we insert the (single or multiple) bit-flip fault in the VHDL code according to a predefined mean time between failure (MTBF). Then, we simulate the circuit by running a program (testbench) as close as possible to the application program. After simulation, we look for the primary outputs (POs) of the circuit to verify, for each of the injected bit-flip faults, if they affected the functional circuit operation. In this case we can obtain one of the three conclusions:
a)the fault was not propagated to the POs, then it is considered redundant;
b)
the fault was propagated to the POs of the circuit and it was detected by the built-in reliability functions appended to the memory elements. (This can be verified by reading out the outputs of the comparators along with the VHDL code after simulation.) Then, the reliability of the circuit is maintained. (In cases (a) and (b), we have the generation of a codeword.)
c)finally, if the fault produced an erroneous PO and it was not detected by the appended hardware (generation of a non-codeword), then the reliability of the circuit is reduced. This happens because either the reliability functions used in the program fail to detect such a fault, or the choice of the memory elements to be made fault-tolerant is not adequate (because important blocks of storage elements remain in the original form).
At the end of this process, when the whole input test vectors and faults were applied to the circuit, we compute the overall bit-flip fault coverage as a function of the predefined MTBF for the target application as follows:
Bit-Flip_Fault_Coverage(MTBF) = K .
(M - E)
Where: K is the number of detected bit-flip faults; M is the total number of injected bit-flip faults; and E is the number of redundant bit-flip faults in the VHDL code.
After this basic description of how the proposed approach interprets the results obtained from the fault injection procedure, in the following we describe the mechanism used to perform fault injection at the VHDL code level.
Fig. 4 presents the main structure used to inject faults in the VHDL code. In this approach, we use a Linear Feedback Shift Register (LFSR) to inject bit-flip faults into the selected memory element. The proposed approach presents three different operating modes:
(a)normal_mode. In this mode, the circuit is in normal operation and no fault injection is possible during the simulation process.