Common reliability analysis methods and procedures
TOC (for print outs)
TOC (for online reading) 2
0 Preface 3
1 A general introduction into reliability 4
2 Reliability metrics, simple 7
2.1 MTBF 7
2.2 Failure Rate 8
2.3 The bathtub curve 9
2.4 Availability 11
2.5 Reliability 12
3 Reliability metrics, detailed 12
3.1 Reliability 12
3.2 Hazard rate 13
3.3 Failure rate 14
3.4 Some important notes on the Weibull function 16
3.5 Weibull function: Examples 17
3.6 Mean time to failure (MTTF) 19
3.7 Availability 20
4 Distribution functions 21
4.1 Introduction 21
4.2 Normal distribution 22
4.3 Lognormal distribution 24
4.4 Weibull distribution 25
4.5 Gumbel, Frechet, and again Weibull 25
4.6 Gamma 26
5 Common analysis methods and procedures – 26
a selection 26
5.1 Reliability Prediction based upon standards 27
5.1.1 Introduction 27
5.1.2 How it basically works 28
5.1.3 What do they have in common? 30
5.1.4 What are the differences? 30
5.1.5 The standards in detail 31
5.1.5.1 Mil-HDBK-217 31
5.1.5.2 Telcordia 31
5.1.5.3 IEC 62380 32
5.1.5.4 217 Plus 32
5.1.5.5 HRD 5 and 299 C 32
5.1.5.6 NSWC 33
5.1.5.7 NPRD 95 and EPRD 97 33
5.1.6 Relex and the standards 33
5.2 FMEA 34
5.2.1 Introduction 34
5.2.2 Automotive FMEA 35
5.2.2.1 Problems usually encountered with FMEA (mainly with automotive type FMEA) 37
5.2.3 Military FMEA 38
5.2.4 Advantages with FMEA 40
5.2.5 Disadvantages with FMEA 40
5.2.6 Relex FMEA 40
5.3 Fault Tree 41
5.3.1 Relex Fault Tree 42
5.4 Reliability Block Diagram, RBD 42
5.4.1 Introduction 42
5.4.2 Implications 44
5.4.3 Relex RBD and Operation Simulation (OpSim) 45
5.5 Markov 45
5.6 Weibull 47
TOC (for online reading)
0 Preface
1 A general introduction into reliability
2 Reliability metrics, simple
2.1 MTBF
2.2 Failure Rate
2.3 The bathtub curve
2.4 Availability
2.5 Reliability
3 Reliability metrics, detailed
3.1 Reliability
3.2 Hazard rate
3.3 Failure rate
3.4 Some important notes on the Weibull function
3.5 Weibull function: Examples
3.6 Mean time to failure (MTTF)
3.7 Availability
4 Distribution functions
4.1 Introduction
4.2 Normal
4.3 Lognormal
4.4 Weibull
4.5 Gumbel, Frechet, and again Weibull
4.6 Gamma
5 Common analysis methods and procedures – a selection
5.1 Reliability Prediction based upon standards
5.1.1 Introduction
5.1.2 How it basically works
5.1.3 Commonalities
5.1.4 Differences
5.1.5 The standards in detail
5.1.5.1 Mil-HDBK-217
5.1.5.2 Telcordia
5.1.5.3 IEC 62380
5.1.5.4 217 Plus
5.1.5.5 HRD 5 and 299B
5.1.5.6 NSWC
5.1.5.7 NPRD and EPRD
5.1.6 Relex and the standards
5.2 FMEA
5.2.1 Introduction
5.2.2 Automotive FMEA
5.2.2.1 Problems with (mainly automotive) FMEA
5.2.3 Military FMEA
5.2.4 Advantages
5.2.5 Disadvantages
5.2.6 Relex
5.3 Fault Tree
5.3.1 Relex
5.4 Reliability Block Diagram, RBD
5.4.1 Introduction
5.4.2 Implications
5.4.3 Relex RBD and Operation Simulation (OpSim)
5.5 Markov
5.6 Weibull
next
0 Preface
This document discribes common reliability analysis methods, their scopes, limitations, implications, and areas of applicability and some how-to's.
This document is focused on the reliawind project and therefore covers not all methods in the reliability world.
Apart from the math sections, this document does not require detailed reliability knowledge.
Previous next
1 A general introduction into reliability
This is a high level description about basic aspects of reliability.
This introduction doesn't imply any knowledge and experience with reliability.
It is sufficient to understand the following simplistic definition of reliability:
Reliability
=
The probability that an item performs a required function without failure.
Please note that this definition is very simplistic and therefore incomplete.
However, this definition suggests "reliable = failure free", and that's precise enough in order to understand this chapter.
Reliability Standards
The first institution to tackle the wide field of reliability in a
- systematic,
- stringent and
- standardised
manner was the US department of defense (DOD) in the early 1950s.
Meanwhile, the DOD and related institutions (RAC, RiAC,..) have issued hundreds of documents dealing with various aspects of reliability at different levels of detail.
All these DOD issued documents can be divided into
- Handdbooks (Mil-HDBK-XXXXX)
- Standards (Mil-STD-YYYYY)
- Performance Specifications (Mil-PRF-ZZZZZ)
XXXXX , YYYYY and ZZZZZ being a 3 to 5 digit number.
Following list gives an imagination of the vast scope and various levels of detail of these documents. Please note that this list is only a tiny snapshot of what in reality comprises many hundreds of items:
Document number / TitleMIL-HDBK-217 / Reliability prediction of electronic equipment
MIL-HDBK-470 / Designing and developing maintainable products and systems
MIL-STD-471 / MAINTAINABILITY VERIFICATION/DEMONSTRATION/EVALUATION
MIL-HDBK-472 / Maintainability Prediction
MIL-STD-1629 / PROCEDURES FOR PERFORMING
A FAILURE MODE,
EFFECTS AND CRITICALITY ANALYSIS
MIL-HDBK-189 / Reliability Growth Management
MIL-STD-1388 / DOD REQUIREMENTS FOR A
LOGISTIC SUPPORT ANALYSIS RECORD
Mil-STD-105 / SAMPLING PROCEDURES AND TABLES
FOR INSPECTION BY ATTRIBUTES
MIL-PRF-38534 / GENERAL SPECIFICATION FOR HYBRID MICROCIRCUITS
Mil-STD-2074 / FAILURE CLASSIFICATION FOR
RELIABILITY TESTING
MIL-STD-414 / SAMPLING PROCEDURES AND
TABLES
FOR INSPECTION BY VARIABLES
FOR PERCENT DEFECTIVE
Mil-STD-1472 / Human Engineering Design Criteria for Systems, Equipment and Facilities.
Mil-HDBK-338 / ELECTRONIC RELIABILITY DESIGN HANDBOOK
As you can see, these documents cover many aspects of reliability, with emphasis on electronical equipment.
Today, many documents have been made obsolete by the DOD, but they still serve as guidelines, references and look up material in the civil industry.
Obslolete documents are available in the internet for no charge.
The main reason for the obsolescense are:
- sinking military budget,
- "civil" reliability awareness is getting better,
- the difference of reliability awareness between military and civil industries is getting smaller.
Today, the DOD is more and more willing to rely on "civil" reliability techniques.
Reliability Awareness
Apart from the military industries, today's civil industries with the highest level of reliability awareness contain, but are not limited to:
- aviation
- space
- railway
- (nuclear) power
- medical
- automotive
- others like elevator manufacturers, ….
Significant indicators for "awareness" with respect to reliability are:
- company quality policy contains reliability goals,
- methods are established, understood and carried out by personnel,
- industry-specific standards exist,
- written and binding reliability specifications exist,
- suppliers perform significant effort in reliability:
o warranty database with corrective action process
o reliability engineers exist and have influence on R&D
o reliability tests are carried out and have influence on R&D
o ….
The main reason for high awareness is that these industries are subject to governmental or other authorities' requirements.
People may associate the above list of industries rather with safety than reliability.
Dependent from the viewpoint, "reliability" and "safety" can be either the same,
or totally different things.
However, they are very often perceived as synonymous.
Basically, a safe system may be unreliable with respect to the functions not directly related to safety. On the other hand, and for the same reason, a reliable system may be unsafe.
From a technical viewpoint, a safety feature must be reliable in any case.
So, if technicians and engineers talk about safety, they actually deal with reliability (at least in most cases)
Managing Reliability
It is quite easy and straight forward to explain engineers reliability methods and metrics.
Furthermore, almost everybody would agree that supplying reliable products is a success factor.
And finally, almost nobody would decline that having reliability methods and techniques established would improve their companies' business results.
However, practical experience proves that it is very difficult to build and maintain reliability awareness in companies.
Practical experience also shows that this is almost impossible without governmental or other authority's pressure.
And even within regulated industries we would find one or the other black sheep.
Uncertainty of Reliability
A further but essential characteristic of reliability appears when we look at results.
Reliability analysis results are typically highly imprecise and contain quite a lot of uncertainty.
Not only statistical uncertainty (which can be quantified), but also uncertainty regarding assumptions and conclusions (which can hardly be quantified).
In a provocative manner we could say that the accuracy of reliability analysis results can be compared with the accuracy of weather forecasts, nuclear physics, stock price prediction,…
Even more provocative: Reliability analysis results are usually unreliable.
Reasons for the uncertainty of reliability results:
Typically there is not enough information and experience available, so the reliability analyst has to make use of expert guess, plausibility and common sense.
This is even more true because today's customers more and more expect reliability to be determined in advance.
And, because confidence intervals are perceived as "high math" for the majority of all engineers, even reliability analysts themselves may not be aware of the statistical uncertainty of their results.
As an example, Mil-HDBK-217 results appear to be "exact", like 1,05229 failures/million hours.
The virtually unknown truth is that even the first digit is not sure.
The "real" (but unknown) result may typically be in the range of 0,5 to 2,2 .
It is a typical situation in reliability workshops that the audience expects from the trainer unambiguous statements how to perform a reliability analysis, which way is right and which wrong.
Needless to say that they will make a big step forward as soon they have understood the uncertainty of the reliability world.
After a successful reliability training the participants are aware of the methods and standards available, their applicability, their pros and cons, and their uncertainty.
Previous next
2 Reliability metrics, simple
previous next
2.1 MTBF
The most common (and by the way mostly misunderstood) reliability metric ist the MTBF, mean time between failure. MTBF has no relationship with lifetime.
Some real world MTBF examples (all from Telcordia Issue 2)
Assembly / Approx. MTBFPersonal computer / ~2.000 h
Computer hard disk drive / ~15.000 h
Computer mouse / ~100.000 h
Computer keyboard / ~25.000 h
Computer color monitor / ~8.000 h
Computer CD ROM drive / ~8.000 h
The mean time between failure (MTBF) is the average time between two failures of a system, assembly or component.
In most cases, MTBF implies a constant probability of failure versus time, or in other words, MTBF implies steady state conditions.
From a mathematical viewpoint, this is an unneccessary limitation.
In order to better understand the meaning of MTBF, it is useful to provide some "extreme" examples (with MTBF = constant).
Example 1: / Example 2:A rifleman shoots 1000 bullets onto a target. The flight time to the target is 0,5 seconds.
Apparently, the lifetime of one bullet is 0,5 seconds.
Let us assume that 2 out of 1000 shots did not work properly due to failures directly related with the bullet. / A complex system is running nonstop with 1 failure/month.
The system is designed to have 10 years lifetime.
MTBF = (1000 * 0,5 s) / 2 = 250 seconds;
Lifetime = 0,5 seconds / MTBF = 1 month / 1 = 1 month;
Lifetime = 10 years
Conclusion:
- MTBF and lifetime are different metrics. They have no relationship.
- MTBF and lifetime can differ by many powers of ten.
- MTBF is a statistical value that applies to a population of items, and (apart from special cases) not to a specific item.
Although there are further interesting aspects about MTBF, we will not cover them here.
Instead, we will discuss them in the following paragraph "Failure Rate".
Since the failure rate is just the reciprocal of the MTBF, all aspects discussed in the failure rate chapter apply also to the MTBF.
previous next
2.2 Failure Rate
As mentioned before, the Failure Rate is the reciprocal of the MTBF.
A rate is a number of occurences per time unit.
à Failure Rate = Number of failures per time unit.
Common time units are 1.000.000 hours and 1.000.000.000 hours.
Using these time units, real life failure rate numbers will be displayed in a legible and manageable manner, e.g. 123 instead of 0,000123.
previous next
2.3 The bathtub curve
Practical experience in many industries shows that, in many cases, the failure rate over time looks like a bath tub:
In accordance with practical experience, the bathtub curve can be divided into 3 phases:
Phase / Failure rate1. / Early failure phase, infant mortality phase / Decreasing
2. / Useful product life phase / Constant
3. / Wear out phase / Increasing
(The bathtub curve is actually a superposition of the 3 graphs)
1. Infant mortality phase
This phase is usually much shorter than indicated in the above bath tub sketch.
"Weak" Individuals, individuals with initial damage, or individuals that are poor for other reasons, continuously die out during this phase.
The weaker the individual, the higher the likelihood of death per time interval.
The result is a decreasing failure rate over time.
From a customer viewpoint, it is very important that this phase be excluded from the useful product lifetime.
Burn-In is a common means in order to prevent weak individuals from being sold to customer. During burn-in, all individuals are exposed to a well defined stress profile.
Ideally, the stress profile is serious enough to identify weak individuals, but will not harm "sound" individuals.
In some cases, Burn-In is just a limited time of normal operation.
2. Useful product life phase
This phase, and only this phase, is exactly what customers want to see.
Most literature states that during this phase the failure rate is constant.
This statement is indeed true, but it is misleading.
The idea behind is really not trivial.
In order to better understand "constant failure rate", let us first examine the consequences.
The mathematical statement "constant failure rate" is equivalent to the following prosaic statements:
- "failures occur randomly"
- "single failures are unpredictable"
- "the likelihood of failure is independent from the previous history"
- "there is no aging. The individuals are young forever, no matter how long they have been in operation"