IBM®System i™ System Storage™DS8000™Recovery Handbook
IBM®
This document can be found on the web,
Search for document number WPxxxxxx under the category of “White papers”.
Version 1.0
September 5th, 2007
IBM ATS System Storage Europe
Ingo Dimmer
Purpose
The purpose of this recovery handbookis to provide somehandy reference information tostorage administrators for troubleshooting the IBM® System Storage™ DS8000™ in a System i™ CopyServices environment.
This WhitePaper isfocussing on troubleshooting volume access problems, FlashCopy®, MetroMirror and GlobalMirror problems incl. failover/failback and is meant to provide some guidance in addition to the IBM official product documentation to help quickly diagnose respectively recover from failure situations.
To gain the most benefit from this recovery handbook it is suggested that this document is taken as a template for developing a customized version specific to thecustomer’s currentSystem i, SAN and DS8000 storage configuration.
In addition to the provided technical procedures for failure isolation and recoveryit is strongly recommended that customers using a disaster recovery or high availability setup develop their own decision criteria for switching to the disaster recovery or backup site.Augmenting the technical procedures by unambiguous site swap decision criteria withdefined responsibilities and duration targets is important to help minimize the overall recovery time. Defined duration targets which support the decision for a site swap should include the efforts for checking recovery site data consistency andfor failure analysis to compare expected recovery time for the production site versus known recovery time for unplanned site outages.
Disclaimer Notice & Trademarks
THE INFORMATION PROVIDED IN THIS DOCUMENT IS DISTRIBUTED "AS IS"WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR APARTICULAR PURPOSE OR NON-INFRINGEMENT.
IBM shall have no responsibility toupdatethis information.
While IBMhasreviewed each item for accuracy in a specific situation, there is no guarantee that the same orsimilar procedure will work elsewhere.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents or copyrights. Inquiries regarding patent or copyright licenses should be made, in writing, to:
IBM Director of Licensing
IBM Corporation
North Castle Drive
Armonk, NY10504-1785
U.S.A.
IBM, the IBM logo, System Storage, FlashCopy, System i, System i5 and i5/OS are trademarks of International Business Machines Corporation in the United States, other countries, or both.
Other company, product and service names may be trademarks or service marks of others.
Table of Contents
Purpose
Disclaimer Notice & Trademarks
1Basic Error Determination and Trouble Shooting
1.1Display DS8000 Serviceable Events
1.2Reviewing DS8000 SNMP Traps for CopyServices Events
1.3Troubleshooting Volume Access Problems
1.3.1From System i Side
1.3.2From DS8000/HMC Side
1.3.3From SAN Fabric Side
1.4Troubleshooting FlashCopy Problems
1.5Troubleshooting MetroMirror Problems
1.5.1Volume Suspends or Consistency Group Freezes
1.5.2Path Failures
1.5.3Primary Site Disaster Scenarios
1.5.4MetroMirror Failover/Failback Procedures
1.6Trouble Shooting GlobalMirror Problems
1.6.1GM Session, PPRC Path and GlobalCopy Failures
1.6.2GlobalMirror Session Failover/Failback Procedures
1.6.3Failing over a Subset of Volumes via Pausing the GM Session
1.7Problem Data Collection
1.7.1From System i Side
1.7.2From DS8000 Side
1.7.3From SAN Side
1.8References
1Basic Error Determination and Trouble Shooting
1.1Display DS8000 Serviceable Events
A serviceable event is created on the DS8000 HMC for each storage unit problem.
1)Logon to the DS8000 HMC with userID customer and password cust0mer
2)Display serviceable events on the DS8000 HMC via Service Applications → Service Focal Point → Manage Serviceable Events.
An example is shown in Figure 1below:
Figure 1: DS8000 Serviceable Events
2)Refer to the DS8000 ServiceInformationCenter → Messages and codes → Entry table for all messages and codes for further information:
Note:If the DS8000 HMC has been configured for call-home outbound communication it will automatically open a Problem Management Hardware record for a DS8000 serviceable event which needs further attention from IBM support. The Serviceable Events window (see Figure 1) will show the problem reference number (PMH #) which the IBM remote support service representative will take care of.
1.2Reviewing DS8000 SNMP Traps for CopyServices Events
Specifically for DS8000 CopyServices events reviewing the information provided with the following SNMP traps is useful:
- Trap 1xx for PPRC link events
- Trap 100: Remote mirror and copy links degraded*
- Trap 101: Remote mirror and copy links are inoperable*
- Trap 102: Remote mirror and copy links are operational
- Trap 20x for PPRC volume events
- Trap 200: LSS pair consistency group remote mirror and copy pair error*
- Trap 202: Primary remote mirror and copy devices on the LSS were suspended because of an error*
- Trap 210-220 for GlobalMirror events
- Trap 210: Global Mirror initial consistency group successfully formed
- Trap 211: Global Mirror session is in a fatal state
- Trap 212: Global Mirror consistency group failure - Retry will be attempted
- Trap 213: Global Mirror consistency group successful recovery
- Trap 214: Global Mirror master terminated
- Trap 215: Global Mirror FlashCopy at remote site unsuccessful
- Trap 216: Global Mirror slave termination unsuccessful
- Trap 217: Global Mirror paused
- Trap 218: Global Mirror number of consistency group failures exceed threshold
- Trap 219: Global Mirror first successful consistency group after prior failures
- Trap 220: Global Mirror number of FlashCopy commit failures exceed threshold
Refer to the DS8000 InformationCenter → Troubleshooting → Generic and specific alert traps for further information like CopyServices event reason codes:
Note: Setting up the DS8000 HMC for SNMP notification to a customer provided SNMP manager software application is highly recommended especially in System i DS8000 CopyServices environments because SNMP is the only way in which DS8000 CopyServices events can be reported in an OpenSystem host environment.
1.3Troubleshooting Volume Access Problems
Storage access loss problems can originate from all components in the I/O chain consisting of System i server, SAN environment and DS8000 storage subsystem. For a thorough analysis all these components should be checked independently as described below.
1.3.1From System i Side
Refer to following subsections for failure isolation:
- Access loss to a SYSBAS disk unit → see section 1.3.1.1
- Access loss to an IASP disk unit→ see section 1.3.1.2
- Loss of a redundant path to a multi-path disk unit→ see section 1.3.1.3
System i storage access loss problems are logged in the System i Product Activity Log (PAL) and/or QSYSOPR message queue.
Note: Use iSeries Navigator → My Connections → systemname → Basic Operations → Printer Output to easily transfer a spool file to a PC for problem data collection.
1.3.1.1Access Loss to a SYSBAS Disk Unit
Loss of access to SYSBAS disk unit(s) is indicated by SRC A6xx0255 or A6xx0266 being posted with System i entering a freeze state. Regaining access to the missing disk unit(s) is critical for System i in order to become operational again. Once access has been restored System i will automatically resume operation from the point where it lost access. Otherwise it would remain infinitely in its freeze state and there would be no other recovery than to power-down and restore the whole system from backup. The following steps describe how to get information about the missing SYSBAS disk unit(s) when System i has entereda freeze state:
1)Logon to the System i5 HMC
2)Select the menu Server and Partition → Server Management and right-click the i5/OS partition which has SRC A6xx0255/0266 posted selecting Properties
3)In the “Partition Properties” window select the tab Reference Code selecting the current A6xx0255/0266 reference code from the list and clicking on Details
4)Word 8 of the “Reference Code Details” shows the volume S/N of one of the missing disk unit(s) like shown for DS8000 LUN ID 0x1000 in Figure 2below. Word 9 provides information about whether the last operational FC path was lostindicated by SRC 21073002 or whether access was lost to the volume itself indicated by SRC 21073100. This information can be used for further failure isolation from SAN and DS8000 side.
Figure 2: System i Reference Code Details
1.3.1.2Access Loss to an IASP Disk Unit
An access loss to an IASP disk unit will cause a SRC B6000266 PAL entry and an automatic vary-off of the IASP after 20 min. indicated via message CPIB711 “ASP device xxxx failed.” in the QSYSOPR message queue.
1)Review the System i Product Activity Log (see section 1.3.1.5) and display details for the SRC B6000266 entry showing the DS8000 volume S/N to which access was lost.
2)Refer to sections 1.3.1.4 and 1.3.3 for further failure isolation from DS8000 and SAN side.
3)After recovery of the IASP disk unit access loss try to vary-on the IASP again via VRYCFG CFGOBJ(IASP_name) CFGTYPE(*DEV) STATUS(*ON)
1.3.1.3Loss of a redundant Path to a Multi-Path Disk Unit
A lost path to a multi-path disk unit is indicated via message ID CPPEA33 “Warning - An external storage subsystem disk unit connection has failed.” or/and CPI096E "Disk unit connection is missing" posted for every device of the lost path (re-posted every hour) and a SRC 21073002 PAL entry. Verify the following steps to help isolate a lost path problem:
1)Review the QSYSOPR message queue (see section 1.3.1.4) for message CPPEA33 to get the resource name DMPxxx for a disk unit with a failing path.
2)Access System Service Tools to get the System i IOA’s physical location and WWPN of the failing FC path by issuing the i5/OS command STRSST and selecting 1. Start a service tool → 7. Hardware service manager → 3. Locate resource by resource name. Enter the DMPxxx resource name, select option 8=Associated packaging resource(s), then option 5=Display detail to get System i IOA physical location displayed by the Unit ID and Card fields and WWPN displayed by the Worldwide Port Name field
3)Locate the System i IOA of the failed FC path in System Service Tools’Hardware Service Manager → 1. Packaging hardware resourcesand ensure that the IOP/IOA is in "operational" state – if not, try an IOP reset/re-IPL and eventually engage your service provider for further assistance if needed
4)Ensure the status LEDs of the System i FC IOA are either solid green and flashing yellow (link up) – flashing green indicates that the link is down andother states typically indicate a HW problem
5)Ensure the affected System i IOA is logged into the SAN and DS8000
(DSCLI command lshostconnect –loginmay not represent the current login status unless the port is reset via switching to/back from another topology using setioport –topology [fc-al | scsi-fcp] port_ID ; Brocade command switchShow; Cisco command show flogi database) – if not, ensure the switch FC port and DS8000 FC port is in "online" status (DSCLI command lsioport)
6)For any recovered lost path verify it has been recognized by i5/OS via message CPPEA35 “Informational only. A connection to an external storage subsystem disk unit has been restored.” respectively a SRC 27873140 PAL entry.
1.3.1.4Reviewing the i5/OS System Operator Message Queue
1)Logon to the i5/OS system and display the system operator message queue via issuing the command DSPMSG QSYSOPR
2)Use option 5=Display details to display details for a selected message
3)To print out the QSYSOPR message queue for problem data collection issue the command DSPMSG MSGQ(QSYSOPR) OUTPUT(*PRINT)
Note: This print-out doesn’t include the message details.
1.3.1.5Reviewing the System i Product Activity Log
1)Access System Service Tools (SST) by using the command STRSST
2)Select1. Start a service tool from the "System Service Tools (SST)" screen
3)Select1. Product activity log from the "Start a Service Tool" screen
4)Select option 1. Analyze log from the "Product Activity Log" screen
5)Enter "3" for Log (3 = Magnetic media log) and timeframe of log in the "Select Subsystem Data" screen
6)Enter "3" for Report type (3 = Print options) and "Y" for including optional statistical entries in the "Select Analysis Report Options" screen
7)Enter "4" for Report type (4 = Print full report) and "Y" for including hexadecimal data in the "Select Options for Printed Report" screen
8)Press F3 repeatedly and ENTER to exit from SST
9)Display the generated PAL spool file via running the command DSPSPLF FILE(QPCSMPRT) SPLNBR(*LAST)
1.3.1.6Creating a System i HSM System Configuration List Printout
Having a current System i Hardware Service Manager (HSM) system configuration list printout available on paper is highly recommended as reference information to easily associate System i disk resource names with the corresponding DS8000 volume S/N and the System i IOA WWPN.
Use the following steps to create a HSM configuration list (see Figure 3):
1)Access system service tools by using the command STRSST
2)Select option 1. Start a service tool from the "System Service Tools (SST)" screen
3)Select option 7. Hardware Service Managerfrom the "Start a Service Tool" screen
4)Press F6=Print configuration from the "Hardware Service Manager" screen
5)Select the default "Format" option 1=132 characters wide and "Information printed" option 1=Packaging resources sorted by location and press ENTER
6)Press F3 repeatedly and ENTER to exit from SST
7)To ease problem determination for System i access loss problems store a soft-copy of the HSM system configuration list on another system and a paper printout of it together with this recovery handbook.
Figure 3: Example Excerpt from HSM System Configuration List
1.3.2From DS8000/HMC Side
The following items should be verified from DS8000 side to help isolate a System i access loss problem:
1)Verify if all FiberChannel IOAs from the affected System i LPAR are logged into the DS8000
(DSCLI command lshostconnect –loginmay not represent the current login status unless the port is reset via switching to/back from another topology using setioport –topology [fc-al | scsi-fcp] port_ID) – if not, ensure the DS8000 FC ports are in "online" status (DSCLI commandslsioport, lshba storage_immage_ID) and there is no SAN connectivity problem (see section 1.3.3)
2)Verify all System i DS8000 volumes are in “online/normal” state
(DSCLI command lsfbvol)
3)Verify relevant DS8000 resourcesare in online resp. normal state
(DSCLI commands:lsrank,lsarray,lsddm storage_image_ID, lsda storage_image_ID;
An IBM storage CE may check the D8000 resource states on the HMC selectingService Applications→Service Focal Point→Service Utilities, highlight SF, Selected→View Storage Facility State (end of call);
selecting any "FAILED" test and clicking on Details for further information)
1.3.3From SAN Fabric Side
Perform the following steps as a sanity check to isolate System i access loss from SAN perspective:
1)Ensure that both the System i IOA and its corresponding DS8000 host adapter are logged into the SAN fabric
(Brocade command switchShow; Cisco command show flogi database; use DSCLI command lsioport to get the DS8000 adapter WWPN)
2)Ensure the switch zoning is correct so that System i host initiator and DS8000 storage target can “see” each other
(Brocade command cfgShow; Cisco command show zoneset active)
1.4Troubleshooting FlashCopy Problems
With FlashCopy being a DS8000 internal CopyServices function the possibilities for troubleshooting from a user perspective are limited.
Nonetheless check the following items mainly to help exclude a user error with FlashCopy in context with System i:
1)For a FlashCopy establish failure with DS8000 message ID “CMUN03049E mkflash: source_volID:target_volID: Copy Services operation failure: incompatible volumes” ensure that both the FlashCopy source and target volumesare the same i5/OS volume model, i.e. they have the same capacity and protection mode which is either “protected” (models A0x) or “unprotected” (models A8x)
(DSCLI command lsfbvol; volume model information is shown by “DeviceMTM” column output)
2)For a FlashCopy establish failure with DS8000 message ID “CMUN03035E mkflash: source_volID:target_volID: Copy Services operation failure: feature not installed” ensure that the FlashCopy license key (PTCs feature #72xx) is installed
(check via DSCLI command lskey).
3)For a System i backup host IPL or an IASP vary-on failure from FlashCopy target volumes verify the following holds true:
- The FlashCopy from SYSBAS or the entire System i disk space was taken while the System i production host accessing the FlashCopy source volumes was powered off, respectively for taking a FlashCopy from a System i independent auxiliary storage pool (IASP) ensure the IASP has been varied-off before establishing or re-synchronizing its FlashCopy relationships.
This is the only way to ensure that all System i modified data in memory is flushed to disk storage for a consistent stateand clean IPL respectively IASP varyonfrom the backup host accessing the FlashCopy target volumes. - Ensure that − unless for a GlobalMirror B to C volume relationship − the FlashCopy relationship was NOT created using the target write inhibit mode
(DSCLI command lsflash source_volID; “TargetWriteEnabled” column output should show “Enabled”)
4)For a FlashCopy establish failure ensurethe following:
- Both FlashCopy source and target volumes are in “online /normal” state
(DSCLI command lsfbvol) - No violation to the rule that a FlashCopy target volume can be only in one FlashCopy relationship
If the specified source is already used as a FlashCopy target volume message ID “CMUN03008E mkflash: source_volID:target_volID: Copy Services operation failure: cascading FlashCopy prohibited” or if the specified target is already used as target volume in an existing FlashCopy relationship message ID “CMUN03042E mkflash: source_voldID:target_volID: Copy Services operation failure: already a FlashCopy target” is posted.
1.5Troubleshooting MetroMirror Problems
Refer to the corresponding subsection below to troubleshoot PPRC volume, path or primary site disaster failures.
1.5.1Volume Suspends or Consistency Group Freezes
Perform the following steps to troubleshoot DS8000 PPRC suspend or consistency group freeze problems:
1)Review the SNMP trap 202 (resp. 200) message to find out the suspend reason code*
2)Check the current PPRC volume pair states on primary and secondary DS8000
(DSCLI command lspprc source_vol_ID:target_vol_ID)