Reading: Rectify fault and test
Rectify fault and test
Inside this reading:
Planning the rectification process
Planning for system rollback
Acceptance tests
Summary
Planning the rectification process
You have probably already learned about the fault finding process in general from previous learning packs. You learned about the scientific method for fault finding (cyclic method) and the necessary steps that need to be taken in order to rectify a fault.
The scientific method proposes to use logical and systematic steps (procedures), to analyse available information, such as symptoms, in the hope of finding information that is useful and relevant whilst discarding what is not. This procedure will enable you to draw conclusions and hopefully arrive at the source of the problem. Generally, the method is repeated (cyclic), until the source of the problem has been identified.
The principles of the scientific method are summarised in the following steps:
- Gather Information
- State the Problem
- Form a hypothesis
- Test the hypothesis
- Draw conclusions
- Repeat when necessary
This scientific method underpins cyclic fault finding. You might remember that in Learning Pack ‘Obtaining Fault Finding Tools’, we described cyclic fault finding as featuring eight steps:
- Define Fault
- Gather Details
- Determine Probable Cause for Fault
- Create an Action Plan
- Implement Action Plan
- Observe Result
- Repeat if needed
- Document
This learning pack deals with the last 5 steps of the cyclic fault finding method—particularly with creating and implementing actions plan. Ultimately, action plans are the instruments that enable us to solve faults.
Developing an Action Plan
In terms of fault resolution and rectification, action plans are the summary of steps to be taken in order to solve a fault. In relation to fault finding, action plans needn’t be complicated or lengthy. Instead, action plans simply outline the steps that will be taken to try to solve the fault.
As stated, action plans generally outline the needed steps which will be taken in an attempt to solve a fault or problem. In many circumstances, the action plan is simply suggested by a fault finding tool such as a decision tree. Decision trees, are essentially aimed at helping the trouble-shooter make decisions and implement actions, depending on possible scenarios.
An action plan will generally have the following characteristics:
- Acknowledges the presence of a fault, providing justification for action to be taken
- Identifies the systems or components affected or impacted
- Identifies the objectives of the plan (ie restore optimum functionality)
- Identifies resources needed, including hardware, software, human resources and procedures
- Identifies severity and criticality, hence priority
- Identifies a timeframe for implementation, according to priority
- Identifies any support contracts that might exist and be applicable to the system in question
- Indicates actual remedial steps to be taken. This might include system reconfiguration, re-installation, software patches, component replacement, consultation with vendors to engage as needed
- Indicate risks including possible disruptions as a result of remedial action
- Identifies a workaround solution in case the previous steps failed to rectify the fault
The above example is a very comprehensive plan, with all the items that should be included in an action plan. Keep in mind that in many cases, electronic change management systems will automate many of these steps. Understandably, this is a good thing; otherwise technicians would spend a great deal of their time formulating action plans.
Action plans are particularly important for faults that have a significant impact on a business as a whole. Trivial faults, such as those considered routine, do not warrant a formal action plan. Routine fault finding generally is performed ad-hoc; that is, technicians are able to solve common problems assisted by historical data, knowledge bases and well documented procedures, without having to resort to special action plans.
Minimum disruption to clients
Regardless of the nature and severity (impact) of a fault, technicians will strive to resolve problems with minimum disruption to clients. Sometimes, disruption cannot be helped, as the fault itself is disruptive enough; however, the remedial steps should be such that disruption is kept to an absolute minimum.
Some of the strategies that could be adopted are summarised below:
- Identify the extent and impact of the fault. You need to know what has been affected by the fault itself and not by the steps you have taken. You must know whether you have made things worse, or whether the symptoms are from a fault you inadvertedly caused.
- When formulating an action plan, identify the most effective steps; that is, those actions that would fix the fault and cause minimal disruption
- If systems are unusable, you might isolate them from the rest of the network for testing and troubleshooting. This would avoid troubleshooting activity affecting working systems.
- Liaise with clients to find times that are convenient to them
- If you are dealing with critical infrastructure components such as servers, routers etc; perform your testing and changes outside business hours
- If components were isolated from network, be sure to fully test in lab/workshop before reintegrating them. Systems with changed configurations might cause unexpected results and new faults
- If running tests on a live network, understand the impact of these tests. Some tests can have very negative effects on performance.
- Have a rollback or back-out plan
Planning for system rollback
Rolling back or backing out is fundamental to effective and efficient troubleshooting. Rollback and back-out plans are the strategies that you might need to implement if things do not work out. If the steps that you took as per your action plan weren’t effective and the fault is not resolved, you need to take a step back or rollback. You must be able to restore the system to the previous state. If you are not able to rollback, the situation could in fact, get worse.
If the modifications you introduced have not met the objectives stated by the action plan, then a decision needs to be made about what, if anything should be done. If the fault is affecting users or parts of the IT infrastructure adversely in new and different ways, a decision might be made to back out the change and remove it from the production environment.
You also must consider some of the issues involved with rolling back a change:
- The amount of effort (time, resources etc) required to perform the rollback
- The effect it might have on other (either planned or already deployed) changes.
- The possibility that users are already using the changed system, although not to the best effect, and removing some functionality that the users have become accustomed to may be worse than leaving it as is.
If you are in a position that you need to implement a rollback, and possibly implement some emergency measures, you must think about the following questions:
- Has the problem been correctly analysed?
- Has the proposed remedy been adequately tested?
- Has the solution been correctly implemented?
When faced with a possible rollback, it might be better to provide a partial service in order to allow the system to be thoroughly tested rather than to suspend the service temporarily, and then implement the change.
Once the system has been successfully rolled back, you must return to the first step of the Cyclic Fault Finding Method.
Acceptance tests
An ‘Acceptance Test’ can be defined as a formal test conducted to determine whether or not a system satisfies its acceptance criteria and to enable the customer to determine whether or not to accept the system. Acceptance tests may also be knows as Functional Tests.
Acceptance tests are common when commissioning new systems, system upgrades or when significant changes and enhancements are implemented.
Acceptance testing in relation to fault finding and rectification is not as detailed and comprehensive as when, for instance, implementing a new corporate application. However, acceptance testing is still necessary in order to close fault finding cases. If an acceptance test fails, then the problem still exists as it did before, or some functionality has not been fully restored.
In general, acceptance testing aims at
- Designing and building an accurate test environment that models the conditions in production
- Performing user acceptance tests
- Performing controlled pilot testing in the production environment where necessary (if applicable)
- Evaluating acceptance testing results to make a valid decision to move toward declaring a fault resolved
The development of an Acceptance Test involves a number of iterative steps:
- Assess the type of testing required
- Develop the procedures and instructions for testing
- Develop the necessary test scripts
- Execute the test scripts
- Report any defects
- Retest any fixes
Note: test script refers to the series of steps to be taken during testing, and not programme scripts, such as Perl or Java script—although these could well form part of the testing process.
Acceptance testing addresses step 6 – ‘Observe Result’ (test) of the cyclic fault finding method. You may formulate a test plan by carefully following the 6 steps outlined in Planning the Rectification Process.
Acceptance Test Criteria
Acceptance test criteria refer to what things should be considered to determine fault resolution, correct operation or expected functionality. In relation to fault finding, the criteria might be very simple: the system works as expected! However, acceptance testing goes beyond the basics by formalising the process and getting the user to acknowledge that the fault has been fully fixed. Alternatively, the user acknowledges that satisfactory action has been taken to provide an optimum alternative or workaround.
Criteria are often referred to as metrics. Metrics are statistical information or values that are used as evidence for evaluating the performance of a system.
Imagine that a user could not access the network and logged a call with the company’s help desk. Remedial action is taken and the following criteria used to ascertain acceptance:
- User able to log in to network
- User able to access e-mail
- User able to access shared files and printers
- User able to access Internet
- User able to access corporate applications such as corporate database reporting systems
- Performance when accessing network as expected
- System logs do not report errors
- Monitoring software does not indicate faults or network errors
- All of the above are true all the time
Quite often, users will be satisfied as soon as they realise that functionality has been re-established. Nonetheless, acceptance criteria are important so that faults are fully resolved the first time and do not re-occur.
Help desk operations might go as far as developing acceptance criteria as standard procedures for dealing with the rectification of routine (common) faults.
The implementation of acceptance testing will ultimately enhance the efficiency and effectiveness of a support operation.
Summary
This reading has enabled you to learn about rectifying system faults and the processes confirming fault rectification. You have learned about acceptance testing and it importance in confirming that a system has been restored to full functionality enabling its use without restrictions or limitations. You have learned about implementing strategies for rolling back a system change in case a solution is not forthcoming.
1746_reading.doc
© State of New South Wales, Department of Education and Training 2006- 1 -