Reading: Rectify Fault and Test

Reading: Rectify fault and test

Rectify fault and test

Inside this reading:

Planning the rectification process

Planning for system rollback

Acceptance tests

Summary

Planning the rectification process

You have probably already learned about the fault finding process in general from previous learning packs. You learned about the scientific method for fault finding (cyclic method) and the necessary steps that need to be taken in order to rectify a fault.

The scientific method proposes to use logical and systematic steps (procedures), to analyse available information, such as symptoms, in the hope of finding information that is useful and relevant whilst discarding what is not. This procedure will enable you to draw conclusions and hopefully arrive at the source of the problem. Generally, the method is repeated (cyclic), until the source of the problem has been identified.

The principles of the scientific method are summarised in the following steps:

Gather Information
State the Problem
Form a hypothesis
Test the hypothesis
Draw conclusions
Repeat when necessary

This scientific method underpins cyclic fault finding. You might remember that in Learning Pack ‘Obtaining Fault Finding Tools’, we described cyclic fault finding as featuring eight steps:

Define Fault
Gather Details
Determine Probable Cause for Fault
Create an Action Plan
Implement Action Plan
Observe Result
Repeat if needed
Document

This learning pack deals with the last 5 steps of the cyclic fault finding method—particularly with creating and implementing actions plan. Ultimately, action plans are the instruments that enable us to solve faults.

Developing an Action Plan

In terms of fault resolution and rectification, action plans are the summary of steps to be taken in order to solve a fault. In relation to fault finding, action plans needn’t be complicated or lengthy. Instead, action plans simply outline the steps that will be taken to try to solve the fault.

As stated, action plans generally outline the needed steps which will be taken in an attempt to solve a fault or problem. In many circumstances, the action plan is simply suggested by a fault finding tool such as a decision tree. Decision trees, are essentially aimed at helping the trouble-shooter make decisions and implement actions, depending on possible scenarios.

An action plan will generally have the following characteristics:

Acknowledges the presence of a fault, providing justification for action to be taken
Identifies the systems or components affected or impacted
Identifies the objectives of the plan (ie restore optimum functionality)
Identifies resources needed, including hardware, software, human resources and procedures
Identifies severity and criticality, hence priority
Identifies a timeframe for implementation, according to priority
Identifies any support contracts that might exist and be applicable to the system in question
Indicates actual remedial steps to be taken. This might include system reconfiguration, re-installation, software patches, component replacement, consultation with vendors to engage as needed
Indicate risks including possible disruptions as a result of remedial action
Identifies a workaround solution in case the previous steps failed to rectify the fault

The above example is a very comprehensive plan, with all the items that should be included in an action plan. Keep in mind that in many cases, electronic change management systems will automate many of these steps. Understandably, this is a good thing; otherwise technicians would spend a great deal of their time formulating action plans.

Action plans are particularly important for faults that have a significant impact on a business as a whole. Trivial faults, such as those considered routine, do not warrant a formal action plan. Routine fault finding generally is performed ad-hoc; that is, technicians are able to solve common problems assisted by historical data, knowledge bases and well documented procedures, without having to resort to special action plans.

Minimum disruption to clients

Regardless of the nature and severity (impact) of a fault, technicians will strive to resolve problems with minimum disruption to clients. Sometimes, disruption cannot be helped, as the fault itself is disruptive enough; however, the remedial steps should be such that disruption is kept to an absolute minimum.

Some of the strategies that could be adopted are summarised below:

Identify the extent and impact of the fault. You need to know what has been affected by the fault itself and not by the steps you have taken. You must know whether you have made things worse, or whether the symptoms are from a fault you inadvertedly caused.
When formulating an action plan, identify the most effective steps; that is, those actions that would fix the fault and cause minimal disruption
If systems are unusable, you might isolate them from the rest of the network for testing and troubleshooting. This would avoid troubleshooting activity affecting working systems.
Liaise with clients to find times that are convenient to them
If you are dealing with critical infrastructure components such as servers, routers etc; perform your testing and changes outside business hours
If components were isolated from network, be sure to fully test in lab/workshop before reintegrating them. Systems with changed configurations might cause unexpected results and new faults
If running tests on a live network, understand the impact of these tests. Some tests can have very negative effects on performance.
Have a rollback or back-out plan

Planning for system rollback

Rolling back or backing out is fundamental to effective and efficient troubleshooting. Rollback and back-out plans are the strategies that you might need to implement if things do not work out. If the steps that you took as per your action plan weren’t effective and the fault is not resolved, you need to take a step back or rollback. You must be able to restore the system to the previous state. If you are not able to rollback, the situation could in fact, get worse.

If the modifications you introduced have not met the objectives stated by the action plan, then a decision needs to be made about what, if anything should be done. If the fault is affecting users or parts of the IT infrastructure adversely in new and different ways, a decision might be made to back out the change and remove it from the production environment.

You also must consider some of the issues involved with rolling back a change:

The amount of effort (time, resources etc) required to perform the rollback
The effect it might have on other (either planned or already deployed) changes.
The possibility that users are already using the changed system, although not to the best effect, and removing some functionality that the users have become accustomed to may be worse than leaving it as is.

If you are in a position that you need to implement a rollback, and possibly implement some emergency measures, you must think about the following questions:

Has the problem been correctly analysed?
Has the proposed remedy been adequately tested?
Has the solution been correctly implemented?

When faced with a possible rollback, it might be better to provide a partial service in order to allow the system to be thoroughly tested rather than to suspend the service temporarily, and then implement the change.

Once the system has been successfully rolled back, you must return to the first step of the Cyclic Fault Finding Method.

Acceptance tests

An ‘Acceptance Test’ can be defined as a formal test conducted to determine whether or not a system satisfies its acceptance criteria and to enable the customer to determine whether or not to accept the system. Acceptance tests may also be knows as Functional Tests.

Acceptance tests are common when commissioning new systems, system upgrades or when significant changes and enhancements are implemented.

Acceptance testing in relation to fault finding and rectification is not as detailed and comprehensive as when, for instance, implementing a new corporate application. However, acceptance testing is still necessary in order to close fault finding cases. If an acceptance test fails, then the problem still exists as it did before, or some functionality has not been fully restored.

In general, acceptance testing aims at

Designing and building an accurate test environment that models the conditions in production
Performing user acceptance tests
Performing controlled pilot testing in the production environment where necessary (if applicable)
Evaluating acceptance testing results to make a valid decision to move toward declaring a fault resolved

The development of an Acceptance Test involves a number of iterative steps:

Assess the type of testing required
Develop the procedures and instructions for testing
Develop the necessary test scripts
Execute the test scripts
Report any defects
Retest any fixes

Note: test script refers to the series of steps to be taken during testing, and not programme scripts, such as Perl or Java script—although these could well form part of the testing process.

Acceptance testing addresses step 6 – ‘Observe Result’ (test) of the cyclic fault finding method. You may formulate a test plan by carefully following the 6 steps outlined in Planning the Rectification Process.

Acceptance Test Criteria

Acceptance test criteria refer to what things should be considered to determine fault resolution, correct operation or expected functionality. In relation to fault finding, the criteria might be very simple: the system works as expected! However, acceptance testing goes beyond the basics by formalising the process and getting the user to acknowledge that the fault has been fully fixed. Alternatively, the user acknowledges that satisfactory action has been taken to provide an optimum alternative or workaround.

Criteria are often referred to as metrics. Metrics are statistical information or values that are used as evidence for evaluating the performance of a system.

Imagine that a user could not access the network and logged a call with the company’s help desk. Remedial action is taken and the following criteria used to ascertain acceptance:

User able to log in to network
User able to access e-mail
User able to access shared files and printers
User able to access Internet
User able to access corporate applications such as corporate database reporting systems
Performance when accessing network as expected
System logs do not report errors
Monitoring software does not indicate faults or network errors
All of the above are true all the time

Quite often, users will be satisfied as soon as they realise that functionality has been re-established. Nonetheless, acceptance criteria are important so that faults are fully resolved the first time and do not re-occur.

Help desk operations might go as far as developing acceptance criteria as standard procedures for dealing with the rectification of routine (common) faults.

The implementation of acceptance testing will ultimately enhance the efficiency and effectiveness of a support operation.

Summary

This reading has enabled you to learn about rectifying system faults and the processes confirming fault rectification. You have learned about acceptance testing and it importance in confirming that a system has been restored to full functionality enabling its use without restrictions or limitations. You have learned about implementing strategies for rolling back a system change in case a solution is not forthcoming.

1746_reading.doc