Microsoft Windows 2000

Windows Cluster Service Troubleshooting and Maintenance

Microsoft Product Support Services White Paper

Written by Martin Lucas, Microsoft Alliance Support

Published on July 14, 2003

Abstract

This white paper discusses troubleshooting and maintenance techniques for the Cluster service that is included with Microsoft® Windows 2000 Advanced Server and Microsoft Windows 2000 Datacenter Server. Because cluster configurations vary, this document discusses techniques generally. Many of these techniques can be applied to different configurations and conditions.


The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.

This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, AS TO THE INFORMATION IN THIS DOCUMENT.

Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation.

Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property.

© 2003 Microsoft Corporation. All rights reserved.

Active Directory, Microsoft, Windows, Windows Server, and Windows NT are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries.

The names of actual companies and products mentioned herein may be the trademarks of their respective owners.

contents

Introduction 7

Server Clustering 7

Chapter 1: Pre-installation 8

Cluster Hardware Compatibility List 8

Configuring the Hardware 8

Installing the Operating System 10

Configuring Network Adapters 10

Name Resolution Settings 11

Configuring the Shared Storage 12

Configuring SCSI Host Adapters 12

SCSI Cables 12

SCSI Termination 13

Fibre Channel Storage 13

Drives, Partitions, and File Systems 14

CD-ROM Drives and Tape Drives 14

Pre-Installation Checklist 14

Installation on Systems with Custom Disk Hardware 16

Chapter 2: Installation Problems 17

Installation Problems with the First Node 17

Is the Hardware Compatible? 17

Is the Shared SCSI Bus Connected and Configured Correctly? 17

Does the Server Have a Correctly Sized System Paging File? 17

Do All Servers Belong to the Same Domain? 18

Is the Primary Domain Controller Accessible? 18

Are You Installing While Logged On as an Administrator? 18

Do the Drives on the Shared Bus Appear to Function Correctly? 18

Are Any Errors Listed in the Event Log? 18

Is the Network Configured and Functioning Correctly? 18

Problems Configuring Other Nodes 19

Are You Specifying the Same Cluster Name to Join? 19

Is the RPC Service Running on Each System? 19

Can Each Node Communicate Over Configured Networks? 19

Are Both Nodes Connected to the Same Network or Subnet? 20

Why Can You Not Configure a Node After It Was Evicted? 20

Chapter 3: Post-Installation Problems 21

The Whole Cluster Is Down 21

One Node Is Down 21

Applying Service Packs and Hotfixes 21

One or More Servers Stop Responding 21

Cluster Service Does Not Start 22

Cluster Service Starts but Cluster Administrator Does Not Connect 23

Group/Resource Failover Problems 23

Physical Disk Resource Problems 24

File Share Does Not Go Online 24

Problems Accessing a Drive 24

Network Name Resource Does Not Go Online 25

Chapter 4: Problems Administering the Cluster 26

Cannot Connect to the Cluster Through Cluster Administrator 26

Cluster Administrator Stops Responding on Failover 26

Cannot Move a Group 26

Cannot Delete a Group 27

Problems Adding, Deleting, or Moving Resources 27

Adding Resources 27

Using the Generic Resource for Non-Microsoft Applications 28

Deleting Resources 30

Moving Resources from One Group to Another 31

Chkdsk and Autochk 31

Failover and Failback 32

Failover 32

Failback 32

Move Group 32

Factors That Influence Failover Time 32

Chapter 5: SCSI Storage 33

Verifying Configuration 33

Adding Devices to the Shared SCSI Bus 33

Verifying Cables and Termination 34

Chapter 6: Fibre Channel Storage 35

Verifying Configuration 35

Adding Devices 36

Testing Devices 36

Snapshot Volumes 36

Chapter 7: Client Connectivity Problems 38

Clients Have Intermittent Connectivity Based on Group Ownership 38

Clients Do Not Have Any Connectivity with the Cluster 39

Clients Have Problems Accessing Data Through a File Share 39

Clients Cannot Access Cluster Resources Immediately After You Change an IP Address 39

Clients Experience Intermittent Access 40

Chapter 8: Print Spooler Problems 41

About Printer Drivers 41

Driver Problems 41

Point and Print 41

Driver Synchronization 42

Rolling Upgrade from Windows NT 4.0 to Windows 2000 43

Migration from Windows NT 4.0 (as Opposed to Upgrade) 43

Chapter 9: Kerberos Features Available with Windows 2000 Service Pack 3 44

Publishing Printers to Active Directory 44

Effects on Microsoft Exchange 2000 Server or Microsoft SQL Server 44

Effects on File Shares 45

More Information About the Cluster Service and Kerberos 45

Chapter 10: DHCP and WINS 46

WINS 46

DHCP 46

Jetpack 47

DHCP and WINS Reference Material 47

Chapter 11: File Shares 48

Basic File Shares 48

Dynamic File Shares (Share Subdirectories) 48

DFS Root 48

Chapter 12: Chkdsk and NTFS 50

The Importance of Chkdsk 50

Delaying Chkdsk 50

Handling Corrupted Volumes 51

Duration of Chkdsk 51

Faster Chkdsk 52

Proactive Ways to Test a Volume 52

Chkdsk and the Cluster Service 53

Chkdsk Performance: Windows NT 4.0 vs. Windows 2000 53

SAN Solutions with Data Snapshot Capability 53

Things You Can Do to Increase Speed 54

Related Microsoft Knowledge Base Articles 54

Chapter 13: Maintenance 55

Installing Service Packs 55

Service Packs and Interoperability Problems 55

Replacing Adapters 55

Shared Disk Subsystem Replacement 55

System Backups and Recovery 55

Administrative Suggestions 56

Use the Resource Description Field 56

Remove Empty Groups 56

Avoid Unnecessary or Redundant Dependencies 56

Creating Redundancy for Network Adapters 57

Check the Size of the Quorum Log File 57

Monitor Performance 57

Load Balance 58

Practices to Avoid on a Cluster 58

Appendix A: Event messages 60

Cluster Service Events 60

Related Event Messages 85

Appendix B: The Cluster Log File 90

CLUSTERLOG Environment Variable 90

Annotated Cluster Log 90

Version and Service Pack Information 90

Initialization 90

Determining the Node ID 91

Determining the Cluster Service Account That the Node Uses 91

Trying to Join First 91

No Join Sponsors Are Available 92

Reading from the Quorum Disk 93

Enumerating Drives 95

Reading from Quolog.log File 95

Checking for a Quorum Tombstone Entry 97

Checking the Time to Load a New CLUSDB 97

Enumerated Networks 99

Disabling Mixed Operation 100

Forming Cluster Membership 100

Identifying Resource Types 100

Cluster Membership Limit 100

Cluster Service Started 101

Log File Entries for Common Failures 101

Example 1: Disk Failure 101

Example 2: Duplicate Cluster IP Address 102

Appendix C: Command-Line Administration 103

Using Cluster.exe 103

Basic Syntax 103

Cluster Commands 103

Node Commands 103

Group Commands 105

Resource Commands 106

Example of a Batch Job 107

Index 111

Introduction

This white paper discusses troubleshooting and maintenance techniques for the Cluster service that is included with Microsoft Windows 2000 Advanced Server and Microsoft Windows 2000 Datacenter Server. Although there is great similarity between these versions and the first implementation of the service, Microsoft Cluster Service (MSCS) version 1.0, this document is specific to Windows 2000. (MSCS was included with Microsoft Windows NT® 4.0.) MSCS and Windows 2000 Advanced Server support a maximum of two servers in a cluster. These are frequently referred to as nodes. Windows 2000 Datacenter Server supports up to four nodes. Because there are many different types of resources that can be managed in a cluster, it may be difficult sometimes for an administrator to determine what component or resource is causing failures. In many cases, the Cluster service can automatically detect and recover from server or application failures. However, sometimes you may have to troubleshoot attached resources, devices, or applications.

This document is a based on the original cluster troubleshooting white paper, Microsoft Cluster Server Troubleshooting and Maintenance. That paper was specific to Windows NT Server, Enterprise Edition, version 4.0.

Server Clustering

Clustering is an old term in the computing industry. Many readers think that clustering is a complicated subject because early implementations were large, complex, and sometimes difficult to configure. These early clusters were difficult to maintain unless you had an extensively trained and experienced administrator.

Microsoft extended the capabilities of the Windows NT Server operating system through the Enterprise Edition. Microsoft Windows NT Server, Enterprise Edition, includes MSCS. MSCS adds clustering capabilities to Windows NT to achieve high availability, easier manageability, and greater scalability through server clustering.

Windows 2000 Advanced Server and Windows 2000 Datacenter Server also include these high-availability features through the Cluster service (ClusSvc). In these versions of Windows, the core functions and features of the Cluster service have not changed, although they include improvements and new features.

Chapter 1:Pre-installation

Cluster Hardware Compatibility List

Most hardware configurations on the Microsoft hardware compatibility list (HCL) use industry-standard hardware and avoid proprietary solutions, so that you can easily add or replace hardware. Supported configurations will use only hardware that was validated ("certified") by the Cluster Server Hardware Compatibility Test (HCT). These tests go beyond the standard compatibility testing for Microsoft Windows, and they are quite intensive. Microsoft supports the Windows Cluster service only when this feature is used on a validated cluster configuration. Validation is available only for complete configurations tested as a unit. Visit the following Microsoft Web sites to view the cluster hardware compatibility list:

http://www.microsoft.com/windows2000/advancedserver/howtobuy/upgrading/compat/search/computers.asp

http://www.microsoft.com/windows2000/datacenter/hcl/

Note Storage area network (SAN) solutions for simultaneous use with multiple clusters require a separate certification.

Configuring the Hardware

The Cluster service installation process relies on correctly configured hardware. Therefore, make sure that you configure and test each device before you try to configure the Cluster service. A typical cluster configuration includes:

·  Two servers.

·  At least two Peripheral Connect Interface (PCI) network adapters for each server.

·  Local storage.

·  One or more storage systems for external storage, with both servers remaining connected to this storage. This may be a common small computer system interface (SCSI) bus that is separate from any local storage, or it may be Fibre Channel storage.

Although you can configure a cluster that uses only one network adapter in each server, Microsoft strongly recommends that you have a second isolated network for cluster communications. Validated cluster solutions contain at least one isolated network for cluster communications. This is referred to as the private interconnect. You can also configure the cluster to use the primary nonisolated network for cluster communications if the isolated network fails.

The cluster nodes must communicate with each other on a time-critical basis. Communication between nodes is sometimes referred to as the heartbeat. It is important that the heartbeat packets are sent and received on schedule, so Microsoft recommends that you use only PCI-based network adapters, because the PCI bus has the highest priority. Fault-tolerant network adapters are not supported on the private interconnect. During port failure recovery, fault-tolerant network adapters can delay heartbeat packets significantly and actually cause cluster node failure. For redundancy of the private interconnect, it is more effective to form a second isolated network to function if the primary private interconnect fails.


Figure 1

Cluster storage is made up of a compatible PCI-based storage adapter in each server that is separate from local storage. Each server in the cluster is connected to the storage that is allocated specifically for cluster use. When you use SCSI technology for cluster storage, each cluster uses at least one SCSI adapter that is dedicated for use with the intended external cluster storage. Because both servers in the cluster connect to the same bus at the same time, one SCSI host adapter uses the default ID, 7, and the other adapter uses ID 6. This configuration makes sure that the host adapters have the highest priority on the bus. The bus is referred to as the shared SCSI bus, because both systems share connectivity on this bus but arbitrate (negotiate) for exclusive access to one or more attached disk devices. The Cluster service controls exclusive access to the disk devices through the SCSI reserve and release commands.

Fibre Channel storage is frequently used as the storage medium for highly available clusters. Common implementations use Fibre Channel Arbitrated Loop (FC-AL) or Switched Fibre Channel Fabrics (FC-SW). Although the word fibre suggests that the technology uses fiber optic cables, the Fibre Channel specifications allow use of fiber optic or copper cable interconnects.

Before you configure the Cluster service, see the documentation that is included with the validated cluster hardware for installation guidelines and instructions. Some configurations may require a different sequence of installation steps than the steps that are described in the Cluster service documentation.

Installing the Operating System

Before you install Windows 2000, you must decide what role each computer will have in the domain. Will the servers participate in a Windows NT 4.0 domain, or will they be in a domain that uses Microsoft Windows 2000 Active Directory® directory service? A cluster node can be a member server or a domain controller.

The member server role for each cluster node is a possible solution but may have several disadvantages. Node-to-node communication and various registry operations in the cluster require authentication from the domain. Sometimes during ordinary operations, a server may need to receive authentication. Member servers rely on domain controllers on the network for this type of authentication. Lack of connectivity with a domain controller may severely affect performance and may also cause one or more cluster nodes to stop responding until connection with a domain controller is reestablished. In a worst-case scenario, loss of network connectivity with domain controllers may cause complete failure of the cluster. You must make sure that one or more domain controllers are always available.