Microsoft Windows 2000
Windows Cluster Service Troubleshooting and Maintenance
Microsoft Product Support Services White Paper
Written by Martin Lucas, Microsoft Alliance Support
Published on July 14, 2003
Abstract
This white paper discusses troubleshooting and maintenance techniques for the Cluster service that is included with Microsoft® Windows 2000 Advanced Server and Microsoft Windows 2000 Datacenter Server. Because cluster configurations vary, this document discusses techniques generally. Many of these techniques can be applied to different configurations and conditions.
The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.
This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, AS TO THE INFORMATION IN THIS DOCUMENT.
Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property.
© 2003 Microsoft Corporation. All rights reserved.
Active Directory, Microsoft, Windows, Windows Server, and Windows NT are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries.
The names of actual companies and products mentioned herein may be the trademarks of their respective owners.
contents
Introduction 7
Server Clustering 7
Chapter 1: Pre-installation 8
Cluster Hardware Compatibility List 8
Configuring the Hardware 8
Installing the Operating System 10
Configuring Network Adapters 10
Name Resolution Settings 11
Configuring the Shared Storage 12
Configuring SCSI Host Adapters 12
SCSI Cables 12
SCSI Termination 13
Fibre Channel Storage 13
Drives, Partitions, and File Systems 14
CD-ROM Drives and Tape Drives 14
Pre-Installation Checklist 14
Installation on Systems with Custom Disk Hardware 16
Chapter 2: Installation Problems 17
Installation Problems with the First Node 17
Is the Hardware Compatible? 17
Is the Shared SCSI Bus Connected and Configured Correctly? 17
Does the Server Have a Correctly Sized System Paging File? 17
Do All Servers Belong to the Same Domain? 18
Is the Primary Domain Controller Accessible? 18
Are You Installing While Logged On as an Administrator? 18
Do the Drives on the Shared Bus Appear to Function Correctly? 18
Are Any Errors Listed in the Event Log? 18
Is the Network Configured and Functioning Correctly? 18
Problems Configuring Other Nodes 19
Are You Specifying the Same Cluster Name to Join? 19
Is the RPC Service Running on Each System? 19
Can Each Node Communicate Over Configured Networks? 19
Are Both Nodes Connected to the Same Network or Subnet? 20
Why Can You Not Configure a Node After It Was Evicted? 20
Chapter 3: Post-Installation Problems 21
The Whole Cluster Is Down 21
One Node Is Down 21
Applying Service Packs and Hotfixes 21
One or More Servers Stop Responding 21
Cluster Service Does Not Start 22
Cluster Service Starts but Cluster Administrator Does Not Connect 23
Group/Resource Failover Problems 23
Physical Disk Resource Problems 24
File Share Does Not Go Online 24
Problems Accessing a Drive 24
Network Name Resource Does Not Go Online 25
Chapter 4: Problems Administering the Cluster 26
Cannot Connect to the Cluster Through Cluster Administrator 26
Cluster Administrator Stops Responding on Failover 26
Cannot Move a Group 26
Cannot Delete a Group 27
Problems Adding, Deleting, or Moving Resources 27
Adding Resources 27
Using the Generic Resource for Non-Microsoft Applications 28
Deleting Resources 30
Moving Resources from One Group to Another 31
Chkdsk and Autochk 31
Failover and Failback 32
Failover 32
Failback 32
Move Group 32
Factors That Influence Failover Time 32
Chapter 5: SCSI Storage 33
Verifying Configuration 33
Adding Devices to the Shared SCSI Bus 33
Verifying Cables and Termination 34
Chapter 6: Fibre Channel Storage 35
Verifying Configuration 35
Adding Devices 36
Testing Devices 36
Snapshot Volumes 36
Chapter 7: Client Connectivity Problems 38
Clients Have Intermittent Connectivity Based on Group Ownership 38
Clients Do Not Have Any Connectivity with the Cluster 39
Clients Have Problems Accessing Data Through a File Share 39
Clients Cannot Access Cluster Resources Immediately After You Change an IP Address 39
Clients Experience Intermittent Access 40
Chapter 8: Print Spooler Problems 41
About Printer Drivers 41
Driver Problems 41
Point and Print 41
Driver Synchronization 42
Rolling Upgrade from Windows NT 4.0 to Windows 2000 43
Migration from Windows NT 4.0 (as Opposed to Upgrade) 43
Chapter 9: Kerberos Features Available with Windows 2000 Service Pack 3 44
Publishing Printers to Active Directory 44
Effects on Microsoft Exchange 2000 Server or Microsoft SQL Server 44
Effects on File Shares 45
More Information About the Cluster Service and Kerberos 45
Chapter 10: DHCP and WINS 46
WINS 46
DHCP 46
Jetpack 47
DHCP and WINS Reference Material 47
Chapter 11: File Shares 48
Basic File Shares 48
Dynamic File Shares (Share Subdirectories) 48
DFS Root 48
Chapter 12: Chkdsk and NTFS 50
The Importance of Chkdsk 50
Delaying Chkdsk 50
Handling Corrupted Volumes 51
Duration of Chkdsk 51
Faster Chkdsk 52
Proactive Ways to Test a Volume 52
Chkdsk and the Cluster Service 53
Chkdsk Performance: Windows NT 4.0 vs. Windows 2000 53
SAN Solutions with Data Snapshot Capability 53
Things You Can Do to Increase Speed 54
Related Microsoft Knowledge Base Articles 54
Chapter 13: Maintenance 55
Installing Service Packs 55
Service Packs and Interoperability Problems 55
Replacing Adapters 55
Shared Disk Subsystem Replacement 55
System Backups and Recovery 55
Administrative Suggestions 56
Use the Resource Description Field 56
Remove Empty Groups 56
Avoid Unnecessary or Redundant Dependencies 56
Creating Redundancy for Network Adapters 57
Check the Size of the Quorum Log File 57
Monitor Performance 57
Load Balance 58
Practices to Avoid on a Cluster 58
Appendix A: Event messages 60
Cluster Service Events 60
Related Event Messages 85
Appendix B: The Cluster Log File 90
CLUSTERLOG Environment Variable 90
Annotated Cluster Log 90
Version and Service Pack Information 90
Initialization 90
Determining the Node ID 91
Determining the Cluster Service Account That the Node Uses 91
Trying to Join First 91
No Join Sponsors Are Available 92
Reading from the Quorum Disk 93
Enumerating Drives 95
Reading from Quolog.log File 95
Checking for a Quorum Tombstone Entry 97
Checking the Time to Load a New CLUSDB 97
Enumerated Networks 99
Disabling Mixed Operation 100
Forming Cluster Membership 100
Identifying Resource Types 100
Cluster Membership Limit 100
Cluster Service Started 101
Log File Entries for Common Failures 101
Example 1: Disk Failure 101
Example 2: Duplicate Cluster IP Address 102
Appendix C: Command-Line Administration 103
Using Cluster.exe 103
Basic Syntax 103
Cluster Commands 103
Node Commands 103
Group Commands 105
Resource Commands 106
Example of a Batch Job 107
Index 111
Introduction
This white paper discusses troubleshooting and maintenance techniques for the Cluster service that is included with Microsoft Windows 2000 Advanced Server and Microsoft Windows 2000 Datacenter Server. Although there is great similarity between these versions and the first implementation of the service, Microsoft Cluster Service (MSCS) version 1.0, this document is specific to Windows 2000. (MSCS was included with Microsoft Windows NT® 4.0.) MSCS and Windows 2000 Advanced Server support a maximum of two servers in a cluster. These are frequently referred to as nodes. Windows 2000 Datacenter Server supports up to four nodes. Because there are many different types of resources that can be managed in a cluster, it may be difficult sometimes for an administrator to determine what component or resource is causing failures. In many cases, the Cluster service can automatically detect and recover from server or application failures. However, sometimes you may have to troubleshoot attached resources, devices, or applications.
This document is a based on the original cluster troubleshooting white paper, Microsoft Cluster Server Troubleshooting and Maintenance. That paper was specific to Windows NT Server, Enterprise Edition, version 4.0.
Server Clustering
Clustering is an old term in the computing industry. Many readers think that clustering is a complicated subject because early implementations were large, complex, and sometimes difficult to configure. These early clusters were difficult to maintain unless you had an extensively trained and experienced administrator.
Microsoft extended the capabilities of the Windows NT Server operating system through the Enterprise Edition. Microsoft Windows NT Server, Enterprise Edition, includes MSCS. MSCS adds clustering capabilities to Windows NT to achieve high availability, easier manageability, and greater scalability through server clustering.
Windows 2000 Advanced Server and Windows 2000 Datacenter Server also include these high-availability features through the Cluster service (ClusSvc). In these versions of Windows, the core functions and features of the Cluster service have not changed, although they include improvements and new features.
Chapter 1:Pre-installation
Cluster Hardware Compatibility List
Most hardware configurations on the Microsoft hardware compatibility list (HCL) use industry-standard hardware and avoid proprietary solutions, so that you can easily add or replace hardware. Supported configurations will use only hardware that was validated ("certified") by the Cluster Server Hardware Compatibility Test (HCT). These tests go beyond the standard compatibility testing for Microsoft Windows, and they are quite intensive. Microsoft supports the Windows Cluster service only when this feature is used on a validated cluster configuration. Validation is available only for complete configurations tested as a unit. Visit the following Microsoft Web sites to view the cluster hardware compatibility list:
http://www.microsoft.com/windows2000/advancedserver/howtobuy/upgrading/compat/search/computers.asp
http://www.microsoft.com/windows2000/datacenter/hcl/
Note Storage area network (SAN) solutions for simultaneous use with multiple clusters require a separate certification.
Configuring the Hardware
The Cluster service installation process relies on correctly configured hardware. Therefore, make sure that you configure and test each device before you try to configure the Cluster service. A typical cluster configuration includes:
· Two servers.
· At least two Peripheral Connect Interface (PCI) network adapters for each server.
· Local storage.
· One or more storage systems for external storage, with both servers remaining connected to this storage. This may be a common small computer system interface (SCSI) bus that is separate from any local storage, or it may be Fibre Channel storage.
Although you can configure a cluster that uses only one network adapter in each server, Microsoft strongly recommends that you have a second isolated network for cluster communications. Validated cluster solutions contain at least one isolated network for cluster communications. This is referred to as the private interconnect. You can also configure the cluster to use the primary nonisolated network for cluster communications if the isolated network fails.
The cluster nodes must communicate with each other on a time-critical basis. Communication between nodes is sometimes referred to as the heartbeat. It is important that the heartbeat packets are sent and received on schedule, so Microsoft recommends that you use only PCI-based network adapters, because the PCI bus has the highest priority. Fault-tolerant network adapters are not supported on the private interconnect. During port failure recovery, fault-tolerant network adapters can delay heartbeat packets significantly and actually cause cluster node failure. For redundancy of the private interconnect, it is more effective to form a second isolated network to function if the primary private interconnect fails.
Figure 1
Cluster storage is made up of a compatible PCI-based storage adapter in each server that is separate from local storage. Each server in the cluster is connected to the storage that is allocated specifically for cluster use. When you use SCSI technology for cluster storage, each cluster uses at least one SCSI adapter that is dedicated for use with the intended external cluster storage. Because both servers in the cluster connect to the same bus at the same time, one SCSI host adapter uses the default ID, 7, and the other adapter uses ID 6. This configuration makes sure that the host adapters have the highest priority on the bus. The bus is referred to as the shared SCSI bus, because both systems share connectivity on this bus but arbitrate (negotiate) for exclusive access to one or more attached disk devices. The Cluster service controls exclusive access to the disk devices through the SCSI reserve and release commands.
Fibre Channel storage is frequently used as the storage medium for highly available clusters. Common implementations use Fibre Channel Arbitrated Loop (FC-AL) or Switched Fibre Channel Fabrics (FC-SW). Although the word fibre suggests that the technology uses fiber optic cables, the Fibre Channel specifications allow use of fiber optic or copper cable interconnects.
Before you configure the Cluster service, see the documentation that is included with the validated cluster hardware for installation guidelines and instructions. Some configurations may require a different sequence of installation steps than the steps that are described in the Cluster service documentation.
Installing the Operating System
Before you install Windows 2000, you must decide what role each computer will have in the domain. Will the servers participate in a Windows NT 4.0 domain, or will they be in a domain that uses Microsoft Windows 2000 Active Directory® directory service? A cluster node can be a member server or a domain controller.
The member server role for each cluster node is a possible solution but may have several disadvantages. Node-to-node communication and various registry operations in the cluster require authentication from the domain. Sometimes during ordinary operations, a server may need to receive authentication. Member servers rely on domain controllers on the network for this type of authentication. Lack of connectivity with a domain controller may severely affect performance and may also cause one or more cluster nodes to stop responding until connection with a domain controller is reestablished. In a worst-case scenario, loss of network connectivity with domain controllers may cause complete failure of the cluster. You must make sure that one or more domain controllers are always available.