Reading: Plan for disaster recovery

Plan for disaster recovery

Inside this reading:

Security requirements review

Organisational guidelines and policies

Identifying critical systems and data

Risk analysis and management

Risk analysis

Risk management

Backup and restoration

Backup procedures

Restore procedures

Disaster recovery plans

Plan contents

Implementing or testing a DRP

Summary

Security requirements review

Networks can be as simple as two linked computers in a room, to having computers connected all over the world. Computer networks are critical to business—many firms can cease to function if their network fails for any length of time. Networks primarily share resources such as data, hardware peripherals and information. The loss of one computer or server in a network can have disastrous consequences.

When things go wrong, it is important to have a fall back plan that takes account of all shared resources. The plan should anticipate typical events that can affect the average computer network. An organisation needs to be able to recover from disasters quickly —from a hard drive failing to a major event that knocks out all systems—and to recover with minimal loss.

Organisational guidelines and policies

Most organisations have policies and procedures, in print or online, that detail network security requirements. The network or system administrator must understand these and be able to ensure the network can be protected and recovered (within an acceptable period of time) from a disaster.

Your organisation may also have a service level agreement (SLA) between the network group and users, with information about the availability and recovery of the system or network that you will need to know, in order to develop restoration services. The SLA should include information about the expected level of recovery the organisation expects (to be viable again) and is prepared to pay for.

Areas to broadly consider for data recovery are:

  • the security for data stored on the network
  • confidence in the backup/restore process
  • the need for a risk analysis
  • how long the organisation can (and is prepared to) wait for recovery
  • the need for a disaster recovery plan
  • the system for the recovery and restoring of data
  • the recovery or replacement of the infrastructure (which can include buildings, hardware and networks).

Identifying critical systems and data

To review security requirements you need to identify how critical the network and systems are, and which data or systems are the most critical.

The failure of an air traffic control system, as a first example, could cause loss of life, yet a system is generally considered critical if a major financial loss results from having it fail. The loss of data needed for day to day operations, such as accounts receivable or invoicing information, or simply having users unable to do tasks from which income is earned, loosing both productivity and revenue, or having a network, such as the TAB down and unable to do business. The shorter the period of time before losses occur, the more critical the system is.

Before undertaking a risk analysis, critical systems need to be formally identified (even if they are already known) and plans made to ensure they can be recovered as soon as possible.

Ideally, each time a new system is proposed, the business case for it should have identified its importance or criticality.A risk analysis should have been done early in the project, to be included in documentation. Project documentation therefore can help identify risk issues that have been raised. If not, then there may be a need for managers to be surveyed to consider the critical nature of their systems.

Systems are made up of software and the data and the network on which they run. Software applications can be replaced, yet data that may have accrued over many years can be unique and irreplaceable. Systems may become more critical at different times. For example, many businesses work to a monthly accounting cycle; losing a financial system at the month’s end may do more damage than losing it in the middle of the month.

Risk analysis and management

Risk analysis

Risk analysis is the formal way of determining the risks to which systems or data are exposed. To undertake a risk analysis you evaluate assets and examine their susceptibility to threats. Of concern are possible commercial losses from asset loss. Computer networks are critical to the operation of a business or organisation, so it is common to also conduct a specific network or system risk analysis.

Disaster recovery planning is one product of risk analysis. The key steps for risk analysis are to:

1Identify assets to include (hardware or software, buildings, even key staff).

2Identify threats by determining the events that may affect assets.

3Consider the probability of each event occurring.

4Estimate the possible loss that could occur.

5Consider safeguards to prevent or recover from the event.

6Carry out a cost benefit analysis of loss versus the cost of a safeguard.

7Implement safeguards and a recovery plan.

Risk analysis should start after evaluating what security is required for users and data.

Some organisations are more dependent on IT infrastructure than others and so are more susceptible to loss, should systems fail. Most organisations, no matter their size, could suffer if hardware is stolen or destroyed, since the value of any data may lost be greater than the hardware cost they would recoup from insurance.An organisation will therefore do risk analysis in order to identify:

  • how dependent they are on the network
  • what could go wrong with the network
  • what they may lose
  • what can be done about it?

In most cases the risk analysis leads to a series of recommendations to change procedures and systems. Yet organisations can choose to:

  • ignore the risk event and hope it never happens
  • prevent the event from having an impact
  • allow for the event but speedily recover from it.

All options are valid. What is important in business is that the decision is made after all the facts and options have been carefully analysed.

Network risks

Networks, as assets of a business, comprise:

  • computer hardware
/
  • data

  • software
/
  • people.

  • network hardware

Systems that will have the greatest impact on an organisation and will need more protection are described as ‘mission critical’.A major partof risk analysis is the role of the network in business continuity, which isafter all, a primary goal of a business.

Network infrastructure

The network infrastructure is a major area to be assessed for risk; hardware can be stolen, can breakdown or be lost toa major disaster, such as fire.

The disaster recovery plan should allow for rapid replacement of critical components, even if they have to be hired or rented. The beauty of networks built from PCs is that they are readily available and easy to replace. Specialised or less common equipment can be cause for concern.The risk analysis should examine all parts of the infrastructure and identify points of failure and areas where the organisation depends on third parties. Wide area networks, for instance, are harder to replace, especially if the disaster has occurred with the telecommunications supplier.

Building security and safeguardsare also prime factors for risk analysis. Do staff members need keys to get into the office? Are all visitors accompanied by staff? Are badges always worn? Are sprinkler systems fitted? Is the office area wide open (increasing the possibility of theft)? Are there alarms (in case of fire or theft)?

Business data on the network

Business data can be as critical as infrastructure (though more subject to control since it will not usually depend on third parties).

Two factors of business data that need to be part of risk analysis are:

1The processing of transactions

2The storage of transactions or data.

Risks with each need to be considered; while most people in IT know about backup procedures and their importance, risks associated with processing are not always so obvious (sending bills for millions of dollars to pensioners or cheques for two cents, for instance).

During the early stages in selecting and/or designing systems for transaction processing, a risk analysis should have gauged the likely impact ofsuch errors (a systems developer rather than a network administrator would do this). Once we know data is correct, we apply traditional safeguards such as backup and offsite data storage to recover from disasters.

Identifying and costing risk events

Risk analysis may be done at the start of a new project, when selecting new software or hardware, or when changes are planned.It should not be thought of as one major activity but as a regular management process carried out when needed.

Once we know what we are protectingwe can turn to what we are protecting it from. Usually done in a group, this can be a time for imaginative brainstorming to think of everything that might go wrong. Every one is a villain; storms and tempests are out to get you; computers will breakdown; cables are there to be cut by diggers…

More formally, events that could badly affect or destroy a network or system can then be classed variously as:

  • Internal (that relate to staff and management) or external (and outside organisational control)
  • Natural (as in natural disasters and ‘acts of God’) or human made (and if so malicious or accidental)
  • Major (usually a complete loss) or minor (an annoyance or temporary loss).

It is normal to concentrate on events that will have a major impact. If you can solve these, then the solution will normally also overcome minor irritations.Impacts may include:

  • loss of money such as theft of equipment
  • loss of commercial data such as theft of data or hacking
  • delay in processing which can impact cash flow
  • inaccurate data that directly affects cash flow or causes problems with customers andsuppliers or both
  • low staff moral if they suffer from customer complaints
  • failure to meet legal obligations such as contract clauses or tax requirements.

Major events can cost a company millions and even put it out of business. The risk analysis team needs to put a dollar value ona variety of events (usually with the assistance of accountants, to determinesales, inventory, supplies, profit, etc), even for less tangible losses, such as loss of customer confidence.

Event probability

Risk analyses will often use probability theory to cost events. Actuarial tables list the probability of being burgled or catching fire and relatively scientific calculations can be done.A simpler approach is to consider whether the event is very likely, possible or unlikely.

For example, the probability of a virus infection in three sites might be assessed as in Table 1.

Table 1: Simple probability scale for virus infection

Situation / Probability
University or college with PCs / Very likely
Standalone PC in professional office / Possible
Research lab with Cray computers / Unlikely

However, even a Cray computer could still have a virus written for it, so that’s why we do not use the classification ‘impossible’. If something is possible, then ideally it should be safeguarded against.

Risk management

When an organisation can identify high-cost and highly likely events, itmust decide if to:

  • ignore them
  • prevent them
  • minimise impact and make plans to recover from them.

With preventative action you attempt to decrease or prevent the probability of the event occurring or causing damage, or minimise the risk by limiting the likely damage. For example, an extensive sprinkler system will ensure that fire does less damage to premises (with some cost of water damage).

Recovery procedures, such as the use of a ‘hot site’ with a mirror network ready, so the system may be quickly restored after the event, will allow for speedy recovery after a fire has gutted a building, for instance. A disaster plan will comprise a range of contingency plans and recovery and prevention options will vary depending on the threat analysed.

While risk management strategies help ensure business continuity, they can also affect insurance costs, especially for larger companies.

Some more common solutions to problems are outlined in Table 2.

Table 2: Recovery and prevention options, as part of risk management

Problem / Option / Type
Need to get data or software back when it has been destroyed or corrupted / Backup / Recovery
Need to minimise the impact of software bugs and errors / Testing / Prevention
Need to stop unauthorised access and data theft or destruction / User security / Prevention
Need stop errors in the data / System controls / Prevention
Need to minimise the impact of a major disaster at the main site / Hot sites / Recovery
Need to stop unauthorised access to data / Encryption / Prevention
Need to stop virus attacks / Virus checking software / Prevention
Need to minimise user errors / User training / Prevention
Need to stop software being copied and breaking license agreements / Software keys / Prevention
Need to allow access to data to continue even if a disk fails / Mirrored disks or redundant array of inexpensive disks(RAID) systems / Prevention
Need to stop unauthorised access to data and data destruction / Access rights / Prevention
Need to minimise impact of power loss or spikes and surges / Uninterruptible power supplies (UPS) / Prevention

Management will make business decisions based on the analysis of risk against the cost of safeguards, although most of the items in the table above are basic and affordable ways of managing risk for systems and networks.

Backup and restoration

While asystem administrator may not actually carry out backupstheymust ensure procedures and schedules or automated (usually overnight) processes exist for regular backup. Restore procedures on the other hand only happen when needed—namely when a disaster happens.

Backup procedures

For most organisations it is essential that backup be performed automatically, overnight. This will normally require the use of removable media that has a greater capacity than the volume of data being backed up. A very small site may be able to use CD-ROM or DVD-ROM.

System files should be regularly backed up and so any backup software used must have this capability(not all backup software provided with operating systems has the ability to do this).

The backup process must allow speedy recovery. The easiest way to facilitate recovery is to backup all files every day, which can take a long time. A method to limit the time it takes,while ensuring quick recovery,will include full backups, incremental and differential backups. For example,if a full backup is done on Friday night, only files that have changed since Friday might then be backed-up on the Monday night (an incremental backup) and on Tuesday night files that have changed since Friday will be backed-up, and so on. While the number of files to be backed-up will grow each day, the recovery procedure becomes much easier since a file will either be on the latest disk or the previous full backup.

It is important to keep a record of the different media used as several generations of full backups will be retained. For instance, the organisation may specify which media are to be retained, such as for:

  • the last eight full backups
  • full backup at the end of each financial month for the pervious 12 months
  • full backup at the end of the financial year for the last two years
  • the last 10 differential backup tapes be retained before the tape (or other portable media) is reused.

All media will need to be identified by a code number and a log of tapes (or other media) will ensure that an itemis not reused before its time. Any reports produced by the backup software should also be saved with the back up log, as in the example in Table 3.

Table 3: Backup log

Date of backup / Tape no. Used / Type of backup F/D / Files backed-up / Errors or comments / Initials

The media used for backup must also be secured against disaster. If media is simply placed by the computer on which backups are done, or stored on shelf in the same room, then it also subject to theft of damage in a physical disaster. An automated overnight backup can also suffer the same consequences, to be stolen or destroyed along with the computer.Wide area networks can counter this by backing-up across the network to a remote site.

Restoreprocedures

Restore procedures are used when data or a system are no longer available. They may be as simple as restoring a file from a backup set, to completely restoring the system and its data on another computer, at another site.

In the latter case, documentation of the previous backup steps will help ensure that the system restored is the most recent—restoring old versions could in itself lead to a further disaster.Procedures for this should have been documented in the disaster recovery plan (DRP) and need to be followed in such a way as to ensure that the system is restored in the least possible time and with the minimal impact on the users and the organisation.