OpenVMS Cluster Systems

Document revision date: 19 July 1999

OpenVMS Cluster Systems

Contents

Index

C.4 Computer Fails to Join the Cluster

If a computer fails to join the cluster, follow the procedures in this section to determine the cause.

C.4.1 Verifying OpenVMS Cluster Software Load

To verify that OpenVMS Cluster software has been loaded, follow these instructions:

Step Action

1 Look for connection manager (%CNXMAN) messages like those shown in Section C.1.2.

2 If no such messages are displayed, OpenVMS Cluster software probably was not loaded at boot time. Reboot the computer in conversational mode. At the SYSBOOT> prompt, set the VAXCLUSTER parameter to 2.

3 For OpenVMS Cluster systems communicating over the LAN or mixed interconnects, set NISCS_LOAD_PEA0 to 1 and VAXCLUSTER to 2. These parameters should also be set in the computer's MODPARAMS.DAT file. (For more information about booting a computer in conversational mode, consult your installation and operations guide).

4 For OpenVMS Cluster systems on the LAN, verify that the cluster security database file (SYS$COMMON:CLUSTER_AUTHORIZE.DAT) exists and that you have specified the correct group number for this cluster (see Section 10.9.1).

Step	Action
1	Look for connection manager (%CNXMAN) messages like those shown in Section C.1.2.
2	If no such messages are displayed, OpenVMS Cluster software probably was not loaded at boot time. Reboot the computer in conversational mode. At the SYSBOOT> prompt, set the VAXCLUSTER parameter to 2.
3	For OpenVMS Cluster systems communicating over the LAN or mixed interconnects, set NISCS_LOAD_PEA0 to 1 and VAXCLUSTER to 2. These parameters should also be set in the computer's MODPARAMS.DAT file. (For more information about booting a computer in conversational mode, consult your installation and operations guide).
4	For OpenVMS Cluster systems on the LAN, verify that the cluster security database file (SYS$COMMON:CLUSTER_AUTHORIZE.DAT) exists and that you have specified the correct group number for this cluster (see Section 10.9.1).

C.4.2 Verifying Boot Disk and Root

To verify that the computer has booted from the correct disk and system root, follow these instructions:

Step Action

1 If %CNXMAN messages are displayed, and if, after the conversational reboot, the computer still does not join the cluster, check the console output on all active computers and look for messages indicating that one or more computers found a remote computer that conflicted with a known or local computer. Such messages suggest that two computers have booted from the same system root.

2 Review the boot command files for all CI computers and ensure that all are booting from the correct disks and from unique system roots.

3 If you find it necessary to modify the computer's bootstrap command procedure (console media), you may be able to do so on another processor that is already running in the cluster.
Replace the running processor's console media with the media to be modified, and use the Exchange utility and a text editor to make the required changes. Consult the appropriate processor-specific installation and operations guide for information about examining and editing boot command files.

Step	Action
1	If %CNXMAN messages are displayed, and if, after the conversational reboot, the computer still does not join the cluster, check the console output on all active computers and look for messages indicating that one or more computers found a remote computer that conflicted with a known or local computer. Such messages suggest that two computers have booted from the same system root.
2	Review the boot command files for all CI computers and ensure that all are booting from the correct disks and from unique system roots.
3	If you find it necessary to modify the computer's bootstrap command procedure (console media), you may be able to do so on another processor that is already running in the cluster. Replace the running processor's console media with the media to be modified, and use the Exchange utility and a text editor to make the required changes. Consult the appropriate processor-specific installation and operations guide for information about examining and editing boot command files.

C.4.3 Verifying SCSNODE and SCSSYSTEMID Parameters

To be eligible to join a cluster, a computer must have unique SCSNODE and SCSSYSTEMID parameter values.

Step Action

1 Check that the current values do not duplicate any values set for existing OpenVMS Cluster computers. To check values, you can perform a conversational bootstrap operation.

2 If the values of SCSNODE or SCSSYSTEMID are not unique, do either of the following:

Alter both values.
Reboot all other computers.

Note: To modify values, you can perform a conversational bootstrap operation. However, for reliable future bootstrap operations, specify appropriate values for these parameters in the computer's MODPARAMS.DAT file.

WHEN you change... THEN...

The SCSNODE parameter Change the DECnet node name too, because both names must be the same.

Either the SCSNODE parameter or the SCSSYSTEMID parameter on a node that was previously an OpenVMS Cluster member Change the DECnet node number, too, because both numbers must be the same. Reboot the entire cluster.

C.4.4 Verifying Cluster Security Information

To verify the cluster group code and password, follow these instructions:

Step Action

1 Verify that the database file SYS$COMMON:CLUSTER_AUTHORIZE.DAT exists.

2 For clusters with multiple system disks, ensure that the correct (same) group number and password were specified for each.
Reference: See Section 10.9 to view the group number and to reset the password in the CLUSTER_AUTHORIZE.DAT file using the SYSMAN utility.

Step	Action
1	Verify that the database file SYS$COMMON:CLUSTER_AUTHORIZE.DAT exists.
2	For clusters with multiple system disks, ensure that the correct (same) group number and password were specified for each. Reference: See Section 10.9 to view the group number and to reset the password in the CLUSTER_AUTHORIZE.DAT file using the SYSMAN utility.

C.5 Startup Procedures Fail to Complete

If a computer boots and joins the cluster but appears to hang before startup procedures complete---that is, before you are able to log in to the system---be sure that you have allowed sufficient time for the startup procedures to execute.

IF... THEN...

The startup procedures fail to complete after a period that is normal for your site. Try to access the procedures from another OpenVMS Cluster computer and make appropriate adjustments. For example, verify that all required devices are configured and available. One cause of such a failure could be the lack of some system resource, such as NPAGEDYN or page file space.

You suspect that the value for the NPAGEDYN parameter is set too low. Perform a conversational bootstrap operation to increase it. Use SYSBOOT to check the current value, and then double the value.

You suspect a shortage of page file space, and another OpenVMS Cluster computer is available. Log in on that computer and use the System Generation utility (SYSGEN) to provide adequate page file space for the problem computer.
Note: Insufficent page-file space on the booting computer might cause other computers to hang.

The computer still cannot complete the startup procedures. Contact your Compaq support representative.

IF...	THEN...
The startup procedures fail to complete after a period that is normal for your site.	Try to access the procedures from another OpenVMS Cluster computer and make appropriate adjustments. For example, verify that all required devices are configured and available. One cause of such a failure could be the lack of some system resource, such as NPAGEDYN or page file space.
You suspect that the value for the NPAGEDYN parameter is set too low.	Perform a conversational bootstrap operation to increase it. Use SYSBOOT to check the current value, and then double the value.
You suspect a shortage of page file space, and another OpenVMS Cluster computer is available.	Log in on that computer and use the System Generation utility (SYSGEN) to provide adequate page file space for the problem computer. Note: Insufficent page-file space on the booting computer might cause other computers to hang.
The computer still cannot complete the startup procedures.	Contact your Compaq support representative.

C.6 Diagnosing LAN Component Failures

Section D.5 provides troubleshooting techniques for LAN component failures (for example, broken LAN bridges). That appendix also describes techniques for using the Local Area OpenVMS Cluster Network Failure Analysis Program.

Intermittent LAN component failures (for example, packet loss) can cause problems in the NISCA transport protocol that delivers System Communications Services (SCS) messages to other nodes in the OpenVMS Cluster. Appendix F describes troubleshooting techniques and requirements for LAN analyzer tools.

C.7 Diagnosing Cluster Hangs

Conditions like the following can cause a OpenVMS Cluster computer to suspend process or system activity (that is, to hang):

Condition Reference

Cluster quorum is lost. Section C.7.1

A shared cluster resource is inaccessible. Section C.7.2

Condition	Reference
Cluster quorum is lost.	Section C.7.1
A shared cluster resource is inaccessible.	Section C.7.2

C.7.1 Cluster Quorum is Lost

The OpenVMS Cluster quorum algorithm coordinates activity among OpenVMS Cluster computers and ensures the integrity of shared cluster resources. (The quorum algorithm is described fully in Chapter 2.) Quorum is checked after any change to the cluster configuration---for example, when a voting computer leaves or joins the cluster. If quorum is lost, process and I/O activity on all computers in the cluster are blocked.

Information about the loss of quorum and about clusterwide events that cause loss of quorum are sent to the OPCOM process, which broadcasts messages to designated operator terminals. The information is also broadcast to each computer's operator console (OPA0), unless broadcast activity is explicitly disabled on that terminal. However, because quorum may be lost before OPCOM has been able to inform the operator terminals, the messages sent to OPA0 are the most reliable source of information about events that cause loss of quorum.

If quorum is lost, you might add or reboot a node with additional votes.

Reference: See also the information about cluster quorum in Section 10.12.

C.7.2 Inaccessible Cluster Resource

Access to shared cluster resources is coordinated by the distributed lock manager. If a particular process is granted a lock on a resource (for example, a shared data file), other processes in the cluster that request incompatible locks on that resource must wait until the original lock is released. If the original process retains its lock for an extended period, other processes waiting for the lock to be released may appear to hang.

Occasionally, a system activity must acquire a restrictive lock on a resource for an extended period. For example, to perform a volume rebuild, system software takes out an exclusive lock on the volume being rebuilt. While this lock is held, no processes can allocate space on the disk volume. If they attempt to do so, they may appear to hang.

Access to files that contain data necessary for the operation of the system itself is coordinated by the distributed lock manager. For this reason, a process that acquires a lock on one of these resources and is then unable to proceed may cause the cluster to appear to hang.

For example, this condition may occur if a process locks a portion of the system authorization file (SYS$SYSTEM:SYSUAF.DAT) for write access. Any activity that requires access to that portion of the file, such as logging in to an account with the same or similar user name or sending mail to that user name, is blocked until the original lock is released. Normally, this lock is released quickly, and users do not notice the locking operation.

However, if the process holding the lock is unable to proceed, other processes could enter a wait state. Because the authorization file is used during login and for most process creation operations (for example, batch and network jobs), blocked processes could rapidly accumulate in the cluster. Because the distributed lock manager is functioning normally under these conditions, users are not notified by broadcast messages or other means that a problem has occurred.

C.8 Diagnosing CLUEXIT Bugchecks

The operating system performs bugcheck operations only when it detects conditions that could compromise normal system activity or endanger data integrity. A CLUEXIT bugcheck is a type of bugcheck initiated by the connection manager, the OpenVMS Cluster software component that manages the interaction of cooperating OpenVMS Cluster computers. Most such bugchecks are triggered by conditions resulting from hardware failures (particularly failures in communications paths), configuration errors, or system management errors.

C.8.1 Conditions Causing Bugchecks

The most common conditions that result in CLUEXIT bugchecks are as follows:

Possible Bugcheck Causes Recommendations

The cluster connection between two computers is broken for longer than RECNXINTERVAL seconds. Thereafter, the connection is declared irrevocably broken. If the connection is later reestablished, one of the computers shut down with a CLUEXIT bugcheck.
This condition can occur:

Upon recovery with battery backup after a power failure
After the repair of an SCS communication link
After the computer was halted for a period longer than the number of seconds specified for the RECNXINTERVAL parameter and was restarted with a CONTINUE command entered at the operator console
Determine the cause of the interrupted connection and correct the problem. For example, if recovery from a power failure is longer than RECNXINTERVAL seconds, you may want to increase the value of the RECNXINTERVAL parameter on all computers.

Cluster partitioning occurs. A member of a cluster discovers or establishes connection to a member of another cluster, or a foreign cluster is detected in the quorum file. Review the setting of EXPECTED_VOTES on all computers.

The value specified for the SCSMAXMSG system parameter on a computer is too small. Verify that the value of SCSMAXMSG on all OpenVMS Cluster computers is set to a value that is at the least the default value.

Possible Bugcheck Causes	Recommendations
The cluster connection between two computers is broken for longer than RECNXINTERVAL seconds. Thereafter, the connection is declared irrevocably broken. If the connection is later reestablished, one of the computers shut down with a CLUEXIT bugcheck. This condition can occur: Upon recovery with battery backup after a power failure After the repair of an SCS communication link After the computer was halted for a period longer than the number of seconds specified for the RECNXINTERVAL parameter and was restarted with a CONTINUE command entered at the operator console	Determine the cause of the interrupted connection and correct the problem. For example, if recovery from a power failure is longer than RECNXINTERVAL seconds, you may want to increase the value of the RECNXINTERVAL parameter on all computers.
Cluster partitioning occurs. A member of a cluster discovers or establishes connection to a member of another cluster, or a foreign cluster is detected in the quorum file.	Review the setting of EXPECTED_VOTES on all computers.
The value specified for the SCSMAXMSG system parameter on a computer is too small.	Verify that the value of SCSMAXMSG on all OpenVMS Cluster computers is set to a value that is at the least the default value.

C.9 Diagnosing Port Communication Problems

C.9.1 Port Polling

Shortly after a CI computer boots, the CI port driver (PADRIVER) begins configuration polling to discover other active ports on the CI. Normally, the poller runs every 5 seconds (the default value of the PAPOLLINTERVAL system parameter). In the first polling pass, all addresses are probed over cable path A; on the second pass, all addresses are probed over path B; on the third pass, path A is probed again; and so on.

The poller probes by sending Request ID (REQID) packets to all possible port numbers, including itself. Active ports receiving the REQIDs return ID Received packet (IDREC) to the port issuing the REQID. A port might respond to a REQID even if the computer attached to the port is not running.

For OpenVMS Cluster systems communicating over the CI, DSSI, or a combination of these interconnects, the port drivers perform a start handshake when a pair of ports and port drivers has successfully exchanged ID packets. The port drivers exchange datagrams containing information about the computers, such as the type of computer and the operating system version. If this exchange is successful, each computer declares a virtual circuit open. An open virtual circuit is prerequisite to all other activity.

C.9.2 LAN Communications

For clusters that include Ethernet or FDDI interconnects, a multicast scheme is used to locate computers on the LAN. Approximately every 3 seconds, the port emulator driver (PEDRIVER) sends a HELLO datagram message through each LAN adapter to a cluster-specific multicast address that is derived from the cluster group number. The driver also enables the reception of these messages from other computers. When the driver receives a HELLO datagram message from a computer with which it does not currently share an open virtual circuit, it attempts to create a circuit. HELLO datagram messages received from a computer with a currently open virtual circuit indicate that the remote computer is operational.

A standard, three-message exchange handshake is used to create a virtual circuit. The handshake messages contain information about the transmitting computer and its record of the cluster password. These parameters are verified at the receiving computer, which continues the handshake only if its verification is successful. Thus, each computer authenticates the other. After the final message, the virtual circuit is opened for use by both computers.

C.9.3 System Communications Services (SCS) Connections

System services such as the disk class driver, connection manager, and the MSCP and TMSCP servers communicate between computers with a protocol called System Communications Services (SCS). SCS is responsible primarily for forming and breaking intersystem process connections and for controlling flow of message traffic over those connections. SCS is implemented in the port driver (for example, PADRIVER, PBDRIVER, PEDRIVER, PIDRIVER), and in a loadable piece of the operating system called SCSLOA.EXE (loaded automatically during system initialization).

When a virtual circuit has been opened, a computer periodically probes a remote computer for system services that the remote computer may be offering. The SCS directory service, which makes known services that a computer is offering, is always present both on computers and HSC subsystems. As system services discover their counterparts on other computers and HSC subsystems, they establish SCS connections to each other. These connections are full duplex and are associated with a particular virtual circuit. Multiple connections are typically associated with a virtual circuit.

C.10 Port Failures

C.10.1 Hierarchy of Communication Paths

Taken together, SCS, the port drivers, and the port itself support a hierarchy of communication paths. Starting with the most fundamental level, these are as follows:

The physical wires. The Ethernet is a single coaxial cable. FDDI typically has a pair of fiber-optic cables for redundancy. The CI has two pairs of transmitting and receiving cables (path A transmit and receive and path B transmit and receive). For the CI, the operating system software normally sends traffic in automatic path-select mode. The port chooses the free path or, if both are free, an arbitrary path (implemented in the cables and star coupler and managed by the port).
The virtual circuit (implemented partly in the CI port or LAN port emulator driver (PEDRIVER) and partly in SCS software).
The SCS connections (implemented in system software).

C.10.2 Where Failures Occur

Failures can occur at each communication level and in each component. Failures at one level translate into failures, as described in Table C-3.

Table C-3 Port Failures
Communication Level Failures

Wires If the LAN fails or is disconnected, LAN traffic stops or is interrupted, depending on the nature of the failure. For the CI, either path A or B can fail while the virtual circuit remains intact. All traffic is directed over the remaining good path. When the wire is repaired, the repair is detected automatically by port polling, and normal operations resume on all ports.

Virtual circuit If no path works between a pair of ports, the virtual circuit fails and is closed. A path failure is discovered as follows:

For the CI, when polling fails or when attempts are made to send normal traffic, and the port reports that neither path yielded transmit success.
For the LAN, when no multicast HELLO datagram message or incoming traffic is received from another computer.

When a virtual circuit fails, every SCS connection on it is closed. The software automatically reestablishes connections when the virtual circuit is reestablished. Normally, reestablishing a virtual circuit takes several seconds after the problem is corrected.

CI port If a port fails, all virtual circuits to that port fail, and all SCS connections on those virtual circuits are closed. If the port is successfully reinitialized, virtual circuits and connections are reestablished automatically. Normally, port reinitialization and reestablishment of connections take several seconds.

LAN adapter If a LAN adapter device fails, attempts are made to restart it. If repeated attempts fail, all channels using that adapter are broken. A channel is a pair of LAN addresses, one local and one remote. If the last open channel for a virtual circuit fails, the virtual circuit is closed and the connections are broken.

SCS connection When the software protocols fail or, in some instances, when the software detects a hardware malfunction, a connection is terminated. Other connections are usually unaffected, as is the virtual circuit. Breaking of connections is also used under certain conditions as an error recovery mechanism---most commonly when there is insufficient nonpaged pool available on the computer.

Computer If a computer fails because of operator shutdown, bugcheck, or halt, all other computers in the cluster record the shutdown as failures of their virtual circuits to the port on the shut down computer.

**Table C-3 Port Failures**
Communication Level	Failures
Wires	If the LAN fails or is disconnected, LAN traffic stops or is interrupted, depending on the nature of the failure. For the CI, either path A or B can fail while the virtual circuit remains intact. All traffic is directed over the remaining good path. When the wire is repaired, the repair is detected automatically by port polling, and normal operations resume on all ports.
Virtual circuit	If no path works between a pair of ports, the virtual circuit fails and is closed. A path failure is discovered as follows: For the CI, when polling fails or when attempts are made to send normal traffic, and the port reports that neither path yielded transmit success. For the LAN, when no multicast HELLO datagram message or incoming traffic is received from another computer. When a virtual circuit fails, every SCS connection on it is closed. The software automatically reestablishes connections when the virtual circuit is reestablished. Normally, reestablishing a virtual circuit takes several seconds after the problem is corrected.
CI port	If a port fails, all virtual circuits to that port fail, and all SCS connections on those virtual circuits are closed. If the port is successfully reinitialized, virtual circuits and connections are reestablished automatically. Normally, port reinitialization and reestablishment of connections take several seconds.
LAN adapter	If a LAN adapter device fails, attempts are made to restart it. If repeated attempts fail, all channels using that adapter are broken. A channel is a pair of LAN addresses, one local and one remote. If the last open channel for a virtual circuit fails, the virtual circuit is closed and the connections are broken.
SCS connection	When the software protocols fail or, in some instances, when the software detects a hardware malfunction, a connection is terminated. Other connections are usually unaffected, as is the virtual circuit. Breaking of connections is also used under certain conditions as an error recovery mechanism---most commonly when there is insufficient nonpaged pool available on the computer.
Computer	If a computer fails because of operator shutdown, bugcheck, or halt, all other computers in the cluster record the shutdown as failures of their virtual circuits to the port on the shut down computer.

C.10.3 Verifying CI Port Functions

Before you boot in a cluster a CI connected computer that is new, just repaired, or suspected of having a problem, you should have Compaq services verify that the computer runs correctly on its own.

Contents

Index

privacy and legal statement

4477PRO_025.HTML