Troubleshooting server cluster

The most valuable cluster troubleshooting information is provided in the cluster logs. Configuration log ClCfgSrv.log, which contains entries describing the actions performed when running the New Server Cluster and Add Nodes Wizards, is located in the %Windir%\ System32\LogFiles\Cluster folder on the computer from which the wizards are run. A diagnostic log stores information about all other cluster activities. Its location is controlled by the value of the ClusterLog system environment variable, which is set by default to %Windir%\Cluster\Cluster.log. The ClusterLogSize system environment variable specifies the maximum size of the log in megabytes (the default is 8MB).

Although it takes some time to become familiar with the cryptic notation of the log entries, they contain a wealth of information about the cluster's activities, and usually enable you to pinpoint the cause of cluster-related problems. Each entry in the log is timestamped (in Greenwich Mean Time, so ensure that you account for any time difference when analyzing its content) and identifies a process and the thread that generated it. Most of the resource references use their GUIDs instead of names, but you can look up the corresponding names by checking the keys in the HKEY_LOCAL_MACHINE\Cluster portion of the registry. A detailed tutorial on analysis of the cluster log can be found in the Microsoft Resource Kit.

More significant cluster events are also written to the Windows System Event Log. Using the CLUSDIAG Resource Kit Utility, you can compare cluster-related events from multiple nodes of the same cluster.

If problems are related to a lack of stability for a particular type of resource, you can assign to it a separate Resource Monitor. This will isolate failures to the resource only, without affecting the rest of the cluster. This option is available from the General tab of the resource properties window.

The most severe problems are caused by quorum-related failures. If a disk containing a quorum fails, or a file system, checkpoint file, or log become corrupted, then a cluster will not be able to function. Nor can you use the Cluster Administrator or CLUSTER.EXE utility as long as the Cluster Service is stopped. To start recovery, shut down all the cluster nodes, leaving only the one on which you will perform the restore and the shared storage device running.

Next, launch the Services console from the Administrative Tools menu on one of the cluster nodes, locate Cluster Service, open its Properties window, and type in the -fixquorum parameter (with the leading dash) in the Start Parameters text box. After you click the Start button, the Cluster Service will start without a quorum resource. Even though all of the resources (including quorum) will remain offline, you can take one of the following actions to fix the problem (which one you choose will depend on the cause of the failure):

♦ Change the quorum location (using the steps described in the section "Server cluster management"). A local quorum can be used if no other shared disk is available. In order to connect to the cluster using Cluster Administrator, you have to use either an individual node name or a single dot, directly from the node's console. You can't connect to the cluster via its IP address or name, because both resources are offline.

♦ Run Chkdsk with /f and /r switches to fix the file system corruption. Prior to running it, you will have to manually bring the quorum disk online (otherwise, you will not be able to access it). Running chkdsk against a clustered disk creates a log that you can review in order to identify the level of corruption and final outcome.

♦ Replace the quorum disk and restore the original signature (and quorum database, if desired) with ASR, described in the preceding section. A mismatch in the disk signature for a quorum disk manifests itself in the form of event ID 1034 in the Windows System Event Log. You can also use ClusterRecovery from Resource Kit to reset the disk signature without running ASR.

After you manage to successfully complete one of these three actions, verify whether you have a recent backup of a quorum disk. If so, restore it and restart Cluster Service in normal fashion, without any parameters. Otherwise, stop the Cluster Service and restart it using the resetquorumlog parameter. This will create new quorum files by copying the content of the local cluster database of the node from which recovery is run.

If a quorum is corrupted on one of the local nodes (indicated by the System Event Log entries), you can either perform the system-state restore or copy the current checkpoint file from the shared quorum disk to the %WinDir%\Cluster folder and rename it CLUSDB (by overwriting the corrupted one). You can also copy the CLUSDB from another node, which requires stopping the Cluster Service on that node and unloading the Cluster registry hive using REGEDIT.EXE.

When operating with a Majority Node Set server cluster, a quorum might fail if a majority of the servers becomes unavailable. Most commonly, this happens as a result of network problems (this is not an unlikely event, considering that the Majority Node Set cluster uses the same network for public and private network traffic). If this happens and there is no chance of quick recovery, then it is possible to create a quorum with the remaining nodes only. Create a ForceQuorum REG_SZ entry in the registry at HKEY_LOCAL_MACHINE\SYSTEM\ CurrentControlSet\Services\ClusSvc\Parameters. This entry needs to contain a comma-separated list of names of the servers that are supposed to form the quorum set. The same result can be accomplished by using the CLUSTER.EXE utility with the / forcequorum option.

Problems that occur during the transition of a group membership during failover or failback and the joining of a cluster are frequently related to failing network connectivity. Network problems might prevent authentication requests from reaching the domain controllers. Healthy WINS and DNS infrastructures are required for the proper name resolution and registration. Verify that the WINS and DNS parameters are properly set in the TCP/IP configuration of your cluster network adapters. If your servers point to different DNS servers for name resolution, DNS replication delay can cause problems when adding nodes to a cluster.

Remote Procedure Call (RPC) connectivity problems typically affect failover. An RPCPing utility (from the Exchange 2003 installation CD) or a network analyzer (such as Network Monitor, available on Windows Server 2003) can be used to verify whether this is the case.

When troubleshooting network connectivity issues, keep in mind that the cluster nodes are capable of detecting network failures. The state of the network interfaces participating in cluster communication (both public and private) is displayed in the Network Interfaces node in the Cluster Administrator. Windows Server 2003 servers check for hardware failures by employing the Network Driver Interface Specification (NDIS) network driver features. The server cluster also periodically pings external resources to confirm that connectivity with the rest of the public network has not been lost. The state of the public network connections is also taken into consideration when determining a quorum owner in the case of a private network failure.

Was this article helpful?

0 0
The Ultimate Computer Repair Guide

The Ultimate Computer Repair Guide

Read how to maintain and repair any desktop and laptop computer. This Ebook has articles with photos and videos that show detailed step by step pc repair and maintenance procedures. There are many links to online videos that explain how you can build, maintain, speed up, clean, and repair your computer yourself. Put the money that you were going to pay the PC Tech in your own pocket.

Get My Free Ebook


Responses

  • Liliana
    How to recover cluster registry hive windows 2003 cluster?
    7 years ago
  • aiace
    How to resolve cluster problem in windows server 2003?
    7 years ago

Post a comment