Reading: Determine and plan problem resolution
Determine and plan problem resolution
Inside this reading:
Typical system problems and causes
Boot-up time faults
Poor performance
Network faults
Software and hardware design flaws
Compatibility faults
System misconfiguration and corruption
Problem solving skills
Fault tree analysis
Hierarchical task analysis
Cause and effect diagrams
Formulate a solution or rectification
Implementing a solution
Summary
Typical system problems and causes
The first part of this reading will present a range of typical faults that are likely to occur in computer systems.
Boot-up time faults
Boot-up time faults are those faults that occur during the boot-up sequence. The boot-up sequence is the first major process that occurs when a computer system is turned on. This is a critical stage as the boot-up sequence is especially susceptible to faults, which might render the system unusable.
Generally boot-up faults can be caused by:
- POST failure POST or Power-On-Self-Test is an initial test that a computer system executes automatically when turned on. The system uses POST to test its integrity, by ensuring that all basic functions and components are free of faults. POST generally tests things such as CPU, Mainboard, RAM, Hard Drives, Input Devices etc.
- Boot Device failure If the device responsible for containing the Master Boot Record (MBR) fails, the boot-up sequence fails. Generally, a hard disk drive contains the MBR.
- Operating System Failure If there is a major fault with the Operating System, the boot-up sequence stops. Generally, OS faults are caused by misconfiguration, system files corruption, hardware or software faults which did not appear during POST, compatibility problems etc.
- Minor Faults These faults are not severe enough to halt the boot-up process but they can have an impact on the functionality of the system. For instance, a peripheral device that does initialise correctly at boot time due to lack of proper device drivers installed.
Poor performance
Usually, poor performance does not have a critical impact on a system. In some cases though, the lack of system resources can have severe enough consequences to stop certain functions. For example, a severely congested network might not allow a user to log on to the network (some would argue this is critical enough!), due to timeout errors.
The underlying cause of faults related to poor performance can be found in one or more of the following:
- Not enough Random Access Memory This might result in certain applications not being able to launch and function normally.
- Not enough Virtual Memory A system which does not have enough free disk space available for a paging file (Virtual Memory), might not execute correctly, possibly halting or operating very slowly.
- Slow Central Processing Unit (CPU) This problem won’t allow a system to deal with processing requests in a timely fashion. Applications that require timely processing might not function correctly, or time out. System will operate very slowly.
- Slow Network Network problems are difficult to solve since the possible sources of bottlenecks can be many. Slow servers, slow hubs/switches/router, slow WAN links, slow network cards, or simply congestion will cause slow networks. Slow networks can produce timeout errors on applications, or delays. Slow networks generally produce a large number of errors, requiring retransmissions, which in turn congest the network even further.
- Input/Output (I/O) bottleneck Generally relates to slow hard disk drives. Historically, hard disk drives have lagged behind in terms of bandwidth when compared to CPUs and RAM. A system with little RAM will depend heavily on Virtual Memory (fake RAM on the hard disk drive), increasing the demand for I/O from the hard drive, worsening an existing bottleneck.
Network faults
Network Faults are complex, and their source can be varied. Typical network faults can be related to:
- Performance Network congestion can be a significant problem. Generally, networks that are very heavily used might experience performance issues and congestion. Poor design could also be the cause of this.
- Errors Network errors can be caused by faulty equipment (ie faulty cabling, switch or even Network Interface Card). Congestion can also be a source of errors, as retransmission requests increase. Finally, misconfiguration can lead to significant error rates.
- Security Network Security faults can be complex and varied. The source can be found in misconfiguration, hardware and software design flaws, documented and undocumented bugs, vulnerabilities etc.
Software and hardware design flaws
Computer systems nowadays are incredibly complex. Take Windows, the operating system or even Linux – millions of lines of code have been compiled into each of those systems. The chances of something going wrong can be expected to increase in proportion with the complexity of the systems. It is an accepted fact that new releases of operating systems and applications will experience some ‘teething’ problems. Often, products will not become reliable until at least several months after release and after one or two service packs have been released.
Software applications can also be ‘buggy’ and usually benefit from the regular release of patches and ‘hot-fixes’. Needless to say, as with the OS, administrators are responsible for deploying these fixes.
Hardware design flaws are not uncommon either. Many hardware manufacturers (particularly network equipment manufacturers), release updates for the products in the form of ROM patches (usually called firmware), which can ‘flashed’ onto a device.
Compatibility faults
Compatibility refers to the ability of components (software or hardware) to function and interact properly without faults. Compatible components are designed to certain standard or guideline so that functionality (hence compatibility) is assured.
It seems unusual to think that someone would deploy or install incompatible components. Not so – sometimes incompatibility is not known straight away, not event to the developers themselves, until thorough field testing has been conducted.
Hardware compatibility issues do not arise as often as compatibility is easier to establish beforehand. Plus, initially, devices will either work or not. With software, it is not as clear-cut. Software versioning is a big problem, as some versions might introduce variations of some system files and libraries that until deployed and fully tested, compatibility cannot be established. Everyone at some point has encountered a compatibility problem, which does not manifest itself during installation, until a specific circumstance is created.
Generally, compatibility can be ascertained by following certain guidelines:
- The manufacturer of the component vouches/discloses whether component is compatible. (Ie Built for Windows XP logo)
- Host system meets software and hardware requirements (ie Windows 98 SE and higher, 300Mhz Processor, 64MB RAM, 200MB Disk Space, DirectX 9)
- Components are known to support common technology. For instance, components support Fast Ethernet and TCP/IP
- Careful scrutiny of product specification indicates compatibility.
- Contacting vendor to obtain further information.
- If unsure about compatibility, sometimes physical installation and configuration in testing environment might ascertain compatibility. Not always advisable as physical damage to hardware might result if devices aren’t compatible.
Compatibility issues can produce sporadic faults, particularly when compatibility cannot clearly be established or construed. As usual, faults need to be assessed as per normal IT management policy to assess criticality and determine appropriate action
System misconfiguration and corruption
System misconfiguration can be a significant problem. Generally, misconfiguration can lead to a variety of faults:
- System Services not available
- Network function impaired or not available
- Applications might not work
- Devices might not function
- Poor Performance
- In the worst case, system is unusable.
Misconfiguration can be caused by lack of knowledge/experience from technical personnel, human error, as a consequence of a failed process (for example, a system file becomes corrupt due to a disk failure), malicious software or hacking/cracking. Corruption refers to system files/configuration becoming unusable/unavailable due to a fault. Commonly corruption can occur due to:
- Hardware failure Failed hard disk, fault memory, faulty component.
- Buggy software Sometimes software, which has access to system files, may modify system files without reason due to poor software engineering.
- Security compromise Malicious software, security breaches can deliberately modify system configuration to render a system unusable.
Problem solving skills
There are several problem solving skills that a technician should endeavour to develop. This part of this reading will enable the reader to develop an understanding for problem solving skills such as Fault Tree Analysis (FTA), Hierarchical Task Analysis (HTA) and Cause and Effect. All of the mentioned methods are supported by the scientific method introduced in the first learning pack.
Fault tree analysis
Fault tree analysis is usually done by using decision trees. Fault tree analysis is the process of analysing a fault by using a decision tree. Decision trees can be constructed in advance, for common troubleshooting tasks or they can be constructed ad-hoc for new faults.
Generally, decision trees are based on prior knowledge of the expected behaviour of computer system components. For instance, a user may perform a specific task, causing an expected result or outcome. A technician, would then analyse the outcome, and determine whether the result is what was expected or not. Whichever way, the technician will be able to consult a decision tree, which indicates a suggested course of action. The following example is a simple decision tree that would help a technician to troubleshoot a fault for a user that cannot access his/her e-mail. Take a minute to consider this decision tree.
Figure: Example of decision tree
Decision trees are very helpful for first level troubleshooting. First level troubleshooting is usually done by a help desk/support person with good knowledge of IT systems, but generally not regarded as an expert. Decision trees are not helpful when faults are difficult and out of the ordinary – in this case an expert may be engaged.
Hierarchical task analysis
Hierarchical Task Analysis (HTA) is another valuable skill that be employed for fault-finding purposes. HTA is a logical representation of a process and steps that must occur for this process to begin and finish successfully. The following diagram is an example of an HTA for the boot-up sequence of a typical computer system.
Figure: Sample Hierarchical Task Analysis (HTA) diagram
This HTA shows how a boot up-sequence as expected to happen. The HTA generally is very simple – it only shows a series of small tasks in sequential order that make up the bigger task or process.
The following steps are included in this sample:
- Begin the boot up process
- System is powered up, if successful, continue to next step, otherwise halt system
- POST – Power On Self Test takes place, if successful, continue to next step, otherwise halt system
- Locate Active Drive (typically hard disk drive) and MBR (Master Boot Record), if successful, continue to next step, otherwise halt system
- Execute Bootloader (or bootstrap) program, if successful, continue to next step, otherwise halt system
- Find Operating System and begin loading, if successful, continue to next step, otherwise halt system
- Finish Loading OS and present User Interface, if successful, continue to next step, otherwise halt system
- Boot-up sequence completed
HTA can be very helpful and may be used in conjunction with other tools such as decision trees, during the fault finding process. The great thing about HTA diagrams is that they are simple to construct. Clearly, a good knowledge of the system is required in order to understand what steps need to be taken, to construct a HTA diagram. Due to their usefulness, HTA diagrams are not only used in IT, but right across many fields of industry.
Cause and effect diagrams
Cause and effect is another method that can be used by troubleshooting technicians. Cause and effect is a method which allows a technician to analyse the possible causes of faults (the undesired negative effects). The Cause and Effect method is usually implemented by using Cause and Effect diagrams.
What Is a Cause-and-Effect Diagram? A graphic tool that helps identify, sort, and display possible causes of a problem or quality characteristic. These diagrams sometimes are knows as fishbone diagrams due to their shape.
What are the benefits of Cause and Effect diagrams?
- Helps determine root causes
- Encourages team participation
- Uses an orderly, easy-to-read format
- Indicates possible causes of variation
- Increases process knowledge
- Identifies areas for collecting data
The following sample is a general layout for Cause and Effect Diagrams.
Figure: Cause and Effect "fishbone" diagram
All HTA diagrams begin with the Effect (in our case the undesired [negative] effect or fault) being stated as starting point. Through analysis and brainstorming, one can begin to add possible causes to the resulting effect. In turn, each possible cause is analysed trying to work out the underlining circumstances that might lead to the possible causes of problems. For example, if the [negative] effect is that a user lost a file kept on a disk drive, one possible cause could be that the disk drive’s file system experienced corruption – a further question must be asked: why did the file system corrupt itself? You will see that fishbone diagrams grow in complexity, as each possible cause is further analysed. Have a look at the following diagrams where the [negative] effect is ‘computer downtime’, and how each of the potential causes are analysed to gain further insight into the problem.
Figure: More detailed Cause and Effect "fishbone" diagram
Cause and Effect analysis through Cause and Effect diagrams is a very valuable skill that technicians should endeavour to develop. When used in conjunction with the other two methods presented in this reading (FTA & HTA), cause and effect analysis becomes a very powerful tool for fault finding.
Formulate a solution or rectification
Once the source of a fault has been clearly identified, a solution must be formulated. Clearly, the solution will normally depend on the nature of the fault. In general the following approaches can be taken in order to reach a satisfactory outcome:
- Replace or fix component, whether hardware or software, that is known to be cause of the fault
- Install software patches or hot fixes, provided by software manufacturer/developer
- Install or update ROM if possible as older equipment might not support ‘flashing’ ROM. These are normally provided by equipment manufacturers.
- Adjust configuration, if mis-configuration is the cause of the fault
- Implement a workaround – Generally, a workaround is an acceptable solution when a fault cannot be solved, or the solution is uneconomical, or would have an undesirable/unwanted effect or negative impact on the system.
Implementing a solution
Depending on the fault and its cause, implementing a solution can be fairly straight forward. A technician might be able to provide an instant solution to a trivial fault or to a fault that is regarded as common. Experienced technicians are be able to quickly arrive to a solution based on well known symptoms. For instance, if a computer system has been infected by a well known virus, the technician should be able to take remedial action on the spot, by updating virus definitions, reconfiguring the operating system, and deleting infected files.
When faults are more significant and complex (and possibly critical), planning is required. Sometimes it is not possible or advisable that a technician attempts to fix a problem without planning and making sure that the implications of remedial action are well understood. The following questions should be answered:
- Are replacement software and hardware components on hand?
- Do software and hardware components need procuring? If yes, are funds available to procure needed components?
- Is a fix/patch available from manufacturer?
- Is the impact of remedial action well understood? ie is downtime acceptable, potential loss of revenue
- Will impact to client be minimal?
- Does the company have the skilled personnel that can fix the problem?
- Should external help be sought?
- Would a workaround be the preferred solution?
- Has a rollback strategy been devised?
- Is there any training, education or procedural changes required?
The next step to be taken after applying a fault fix is to perform testing. In support circles, this process is named ‘acceptance testing’. An ‘Acceptance Test’ can be defined as a formal test conducted to determine whether or not a system satisfies its acceptance criteria and to enable the customer to determine whether or not to accept the system. Acceptance tests may also be known as Functional Tests. In other words, acceptance testing allows a technician to ascertain whether the fault has been truly fixed, and that the client has recognised the fault as fixed.
Please note that ‘Planning for fault rectification’ and ‘Acceptance testing’ are covered comprehensively in Learning Pack ‘Rectify Fault and Test’
Summary
This reading has described a variety of typical system faults which you are likely to encounter in your troubleshooting quest. These are typical faults that many technicians have found before; hence, the causes for these problems are well documented. Additionally, this reading has introduced some valuable troubleshooting methods such as the Fault Tree Analysis, the Hierarchical Task Analysis and the Cause and Effect methods. These methods are very useful for streamlining the fault finding process, and will provide you with a guideline and hopefully help you solve more faults in a shorter period.
This reading has also provided an overview the fault rectification process and an introduction to the concept of acceptance testing. These last two concepts are further covered in Learning Pack ‘Rectify fault and test’.
© State of New South Wales, Department of Education and Training 2006- 1 -