OpenVMS Cluster Systems

Updated: 11 December 1998

OpenVMS Cluster Systems

Contents

Index

C.10.4 Verifying Virtual Circuits

To diagnose communication problems, you can invoke the Show Cluster utility using the instructions in Table C-4.

Table C-4 How to Verify Virtual Circuit States
Step Action What to Look for

1 Tailor the SHOW CLUSTER report by entering the SHOW CLUSTER command ADD CIRCUIT,CABLE_STATUS. This command adds a class of information about all the virtual circuits as seen from the computer on which you are running SHOW CLUSTER. CABLE_STATUS indicates the status of the path for the circuit from the CI interface on the local system to the CI interface on the remote system. Primarily, you are checking whether there is a virtual circuit in the OPEN state to the failing computer. Common causes of failure to open a virtual circuit and keep it open are the following:

Port errors on one side or the other
Cabling errors
A port set off line because of software problems
Insufficient nonpaged pool on both sides
Failure to set correct values for the SCSNODE, SCSSYSTEMID, PAMAXPORT, PANOPOLL, PASTIMOUT, and PAPOLLINTERVAL system parameters

2 Run SHOW CLUSTER from each active computer in the cluster to verify whether each computer's view of the failing computer is consistent with every other computer's view.

WHEN... THEN...

All the active computers have a consistent view of the failing computer The problem may be in the failing computer.

Only one of several active computers detects that the newcomer is failing That particular computer may have a problem.

If no virtual circuit is open to the failing computer, check the bottom of the SHOW CLUSTER display:

For information about circuits to the port of the failing computer. Virtual circuits in partially open states are shown at the bottom of the display. If the circuit is shown in a state other than OPEN, communications between the local and remote ports are taking place, and the failure is probably at a higher level than in port or cable hardware.
To see whether both path A and path B to the failing port are good. The loss of one path should not prevent a computer from participating in a cluster.

C.10.5 Verifying CI Cable Connections

Whenever the configuration poller finds that no virtual circuits are open and that no handshake procedures are currently opening virtual circuits, the poller analyzes its environment. It does so by using the send-loopback-datagram facility of the CI port in the following fashion:

The send-loopback-datagram facility tests the connections between the CI port and the star coupler by routing messages across them. The messages are called loopback datagrams. (The port processes other self-directed messages without using the star coupler or external cables.)
The configuration poller makes entries in the error log whenever it detects a change in the state of a circuit. Note, however, that it is possible two changed-to-failed-state messages can be entered in the log without an intervening changed-to-succeeded-state message. Such a series of entries means that the circuit state continues to be faulty.

C.10.6 Diagnosing CI Cabling Problems

The following paragraphs discuss various incorrect CI cabling configurations and the entries made in the error log when these configurations exist. Figure C-1 shows a two-computer configuration with all cables correctly connected. Figure C-2 shows a CI cluster with a pair of crossed cables.

Figure C-1 Correctly Connected Two-Computer CI Cluster

Figure C-2 Crossed CI Cable Pair

If a pair of transmitting cables or a pair of receiving cables is crossed, a message sent on TA is received on RB, and a message sent on TB is received on RA. This is a hardware error condition from which the port cannot recover. An entry is made in the error log indicating that a single pair of crossed cables exists. The entry contains the following lines:

DATA CABLE(S) CHANGE OF STATE PATH 1. LOOPBACK HAS GONE FROM GOOD TO BAD

If this situation exists, you can correct it by reconnecting the cables properly. The cables could be misconnected in several places. The coaxial cables that connect the port boards to the bulkhead cable connectors can be crossed, or the cables can be misconnected to the bulkhead or the star coupler.

Configuration 1: The information illustrated in Figure C-2 is represented more simply in Example C-1. It shows the cables positioned as in Figure C-2, but it does not show the star coupler or the computers. The labels LOC (local) and REM (remote) indicate the pairs of transmitting (T) and receiving (R) cables on the local and remote computers, respectively.

Example C-1 Crossed Cables: Configuration 1

T x = R R = = T LOC REM

The pair of crossed cables causes loopback datagrams to fail on the local computer but to succeed on the remote computer. Crossed pairs of transmitting cables and crossed pairs of receiving cables cause the same behavior.

Note that only an odd number of crossed cable pairs causes these problems. If an even number of cable pairs is crossed, communications succeed. An error log entry is made in some cases, however, and the contents of the entry depends on which pairs of cables are crossed.

Configuration 2: Example C-2 shows two-computer clusters with the combinations of two crossed cable pairs. These crossed pairs cause the following entry to be made in the error log of the computer that has the cables crossed:

DATA CABLE(S) CHANGE OF STATE CABLES HAVE GONE FROM UNCROSSED TO CROSSED

Loopback datagrams succeed on both computers, and communications are possible.

Example C-2 Crossed Cables: Configuration 2

T x = R T = x R R x = T R = x T LOC REM LOC REM

Configuration 3: Example C-3 shows the possible combinations of two pairs of crossed cables that cause loopback datagrams to fail on both computers in the cluster. Communications can still take place between the computers. An entry stating that cables are crossed is made in the error log of each computer.

Example C-3 Crossed Cables: Configuration 3

T x = R T = x R R = x T R x = T LOC REM LOC REM

Configuration 4: Example C-4 shows the possible combinations of two pairs of crossed cables that cause loopback datagrams to fail on both computers in the cluster but that allow communications. No entry stating that cables are crossed is made in the error log of either computer.

Example C-4 Crossed Cables: Configuration 4

T x x R T = = R R = = T R x x T LOC REM LOC REM

Configuration 5: Example C-5 shows the possible combinations of four pairs of crossed cables. In each case, loopback datagrams fail on the computer that has only one crossed pair of cables. Loopback datagrams succeed on the computer with both pairs crossed. No communications are possible.

Example C-5 Crossed Cables: Configuration 5

T x x R T x = R T = x R T x x R R x = T R x x T R x x T R = x T LOC REM LOC REM LOC REM LOC REM

If all four cable pairs between two computers are crossed, communications succeed, loopback datagrams succeed, and no crossed-cable message entries are made in the error log. You might detect such a condition by noting error log entries made by a third computer in the cluster, but this occurs only if the third computer has one of the crossed-cable cases described.

C.10.7 Repairing CI Cables

This section describes some ways in which Compaq support representatives can make repairs on a running computer. This information is provided to aid system managers in scheduling repairs.

For cluster software to survive cable-checking activities or cable-replacement activities, you must be sure that either path A or path B is intact at all times between each port and between every other port in the cluster.

For example, you can remove path A and path B in turn from a particular port to the star coupler. To make sure that the configuration poller finds a path that was previously faulty but is now operational, follow these steps:

Step Action

1 Remove path B.

2 After the poller has discovered that path B is faulty, reconnect path B.

3 Wait two poller intervals, ¹ and then take either of the following actions:

Enter the DCL command SHOW CLUSTER to make sure that the poller has reestablished path B.
Enter the DCL command SHOW CLUSTER/CONTINUOUS followed by the SHOW CLUSTER command ADD CIRCUITS, CABLE_ST.

4 Wait for SHOW CLUSTER to tell you that path B has been reestablished.

5 Remove path A.

6 After the poller has discovered that path A is faulty, reconnect path A.

7 Wait two poller intervals ¹ to make sure that the poller has reestablished path A.

Step	Action
1	Remove path B.
2	After the poller has discovered that path B is faulty, reconnect path B.
3	Wait two poller intervals, ¹ and then take either of the following actions: Enter the DCL command SHOW CLUSTER to make sure that the poller has reestablished path B. Enter the DCL command SHOW CLUSTER/CONTINUOUS followed by the SHOW CLUSTER command ADD CIRCUITS, CABLE_ST.
4	Wait for SHOW CLUSTER to tell you that path B has been reestablished.
5	Remove path A.
6	After the poller has discovered that path A is faulty, reconnect path A.
7	Wait two poller intervals ¹ to make sure that the poller has reestablished path A.

¹Approximately 10 seconds at the default system parameter settings

If both paths are lost at the same time, the virtual circuits are lost between the port with the broken cables and all other ports in the cluster. This condition will in turn result in loss of SCS connections over the broken virtual circuits. However, recovery from this situation is automatic after an interruption in service on the affected computer. The length of the interruption varies, but it is approximately two poller intervals at the default system parameter settings.

C.10.8 Verifying LAN Connections

The Local Area OpenVMS Cluster Network Failure Analysis Program described in Section D.4 uses the HELLO datagram messages to verify continuously the network paths (channels) used by PEDRIVER. This verification process, combined with physical description of the network, can:

Isolate failing network components
Group failing channels together and map them onto the physical network description
Call out the common components related to the channel failures

C.11 Analyzing Error-Log Entries for Port Devices

Monitoring events recorded in the error log can help you anticipate and avoid potential problems. From the total error count (displayed by the DCL command SHOW DEVICES device-name), you can determine whether errors are increasing. If so, you should examine the error log.

C.11.1 Examine the Error Log

The DCL command ANALYZE/ERROR_LOG invokes the Error Log utility to report the contents of an error-log file.

Reference: For more information about the Error Log utility, see the OpenVMS System Management Utilities Reference Manual.

Some error-log entries are informational only while others require action.

Table C-5 Informational and Other Error-Log Entries
Error Type Action Required? Purpose

Informational error-log entries require no action. For example, if you shut down a computer in the cluster, all other active computers that have open virtual circuits between themselves and the computer that has been shut down make entries in their error logs. Such computers record up to three errors for the event:

Path A received no response.
Path B received no response.
The virtual circuit is being closed.
No These messages are normal and reflect the change of state in the circuits to the computer that has been shut down.

Other error-log entries indicate problems that degrade operation or nonfatal hardware problems. The operating system might continue to run satisfactorily under these conditions. Yes Detecting these problems early is important to preventing nonfatal problems (such as loss of a single CI path) from becoming serious problems (such as loss of both paths).

**Table C-5 Informational and Other Error-Log Entries**
Error Type	Action Required?	Purpose
Informational error-log entries require no action. For example, if you shut down a computer in the cluster, all other active computers that have open virtual circuits between themselves and the computer that has been shut down make entries in their error logs. Such computers record up to three errors for the event: Path A received no response. Path B received no response. The virtual circuit is being closed.	No	These messages are normal and reflect the change of state in the circuits to the computer that has been shut down.
Other error-log entries indicate problems that degrade operation or nonfatal hardware problems. The operating system might continue to run satisfactorily under these conditions.	Yes	Detecting these problems early is important to preventing nonfatal problems (such as loss of a single CI path) from becoming serious problems (such as loss of both paths).

C.11.2 Formats

Errors and other events on the CI, DSSI, or LAN cause port drivers to enter information in the system error log in one of two formats:

Device attention
Device-attention entries for the CI record events that, in general, are indicated by the setting of a bit in a hardware register. For the LAN, device-attention entries typically record errors on a LAN adapter device.
Logged message
Logged-message entries record the receipt of a message packet that contains erroneous data or that signals an error condition.

Sections C.11.3 and C.11.6 describe those formats.

C.11.3 CI Device-Attention Entries

Example C-6 shows device-attention entries for the CI. The left column gives the name of a device register or a memory location. The center column gives the value contained in that register or location, and the right column gives an interpretation of that value.

Example C-6 CI Device-Attention Entries

************************* ENTRY 83. **************************** (1) ERROR SEQUENCE 10. LOGGED ON: SID 0150400A DATE/TIME 15-JAN-1994 11:45:27.61 SYS_TYPE 01010000 (2) DEVICE ATTENTION KA780 (3) SCS NODE: MARS CI SUB-SYSTEM, MARS$PAA0: - PORT POWER DOWN (4) CNFGR 00800038 ADAPTER IS CI ADAPTER POWER-DOWN PMCSR 000000CE MAINTENANCE TIMER DISABLE MAINTENANCE INTERRUPT ENABLE MAINTENANCE INTERRUPT FLAG PROGRAMMABLE STARTING ADDRESS UNINITIALIZED STATE PSR 80000001 RESPONSE QUEUE AVAILABLE MAINTENANCE ERROR PFAR 00000000 PESR 00000000 PPR 03F80001 UCB$B_ERTCNT 32 (5) 50. RETRIES REMAINING UCB$B_ERTMAX 32 (6) 50. RETRIES ALLOWABLE UCB$L_CHAR 0C450000 SHAREABLE AVAILABLE ERROR LOGGING CAPABLE OF INPUT CAPABLE OF OUTPUT UCB$W_STS 0010 ONLINE UCB$W_ERRCNT 000B (7) 11. ERRORS THIS UNIT

The following table describes the device-attention entries in Example C-6.

Entry Description

(1) The first two lines are the entry heading. These lines contain the number of the entry in this error log file, the sequence number of this error, and the identification number (SID) of this computer. Each entry in the log file contains such a heading.

(2) This line contains the date, the time, and the computer type.

(3) The next two lines contain the entry type, the processor type (KA780), and the computer's SCS node name.

(4) This line shows the name of the subsystem and the device that caused the entry and the reason for the entry. The CI subsystem's device PAA0 on MARS was powered down.
The next 15 lines contain the names of hardware registers in the port, their contents, and interpretations of those contents. See the appropriate CI hardware manual for a description of all the CI port registers.

(5) The UCB$B_ERTCNT field contains the number of reinitializations that the port driver can still attempt. The difference between this value and UCB$B_ERTMAX is the number of reinitializations already attempted.

(6) The UCB$B_ERTMAX field contains the maximum number of times the port can be reinitialized by the port driver.

(7) The UCB$W_ERRCNT field contains the total number of errors that have occurred on this port since it was booted. This total includes both errors that caused reinitialization of the port and errors that did not.

Entry	Description
(1)	The first two lines are the entry heading. These lines contain the number of the entry in this error log file, the sequence number of this error, and the identification number (SID) of this computer. Each entry in the log file contains such a heading.
(2)	This line contains the date, the time, and the computer type.
(3)	The next two lines contain the entry type, the processor type (KA780), and the computer's SCS node name.
(4)	This line shows the name of the subsystem and the device that caused the entry and the reason for the entry. The CI subsystem's device PAA0 on MARS was powered down. The next 15 lines contain the names of hardware registers in the port, their contents, and interpretations of those contents. See the appropriate CI hardware manual for a description of all the CI port registers.
(5)	The UCB$B_ERTCNT field contains the number of reinitializations that the port driver can still attempt. The difference between this value and UCB$B_ERTMAX is the number of reinitializations already attempted.
(6)	The UCB$B_ERTMAX field contains the maximum number of times the port can be reinitialized by the port driver.
(7)	The UCB$W_ERRCNT field contains the total number of errors that have occurred on this port since it was booted. This total includes both errors that caused reinitialization of the port and errors that did not.

C.11.4 Error Recovery

The CI port can recover from many errors, but not all. When an error occurs from which the CI cannot recover, the following process occurs:

Step Action

1 The port notifies the port driver.

2 The port driver logs the error and attempts to reinitialize the port.

3 If the port fails after 50 such initialization attempts, the driver takes it off line, unless the system disk is connected to the failing port or unless this computer is supposed to be a cluster member.

4 If the CI port is required for system disk access or cluster participation and all 50 reinitialization attempts have been used, then the computer bugchecks with a CIPORT-type bugcheck.

Step	Action
1	The port notifies the port driver.
2	The port driver logs the error and attempts to reinitialize the port.
3	If the port fails after 50 such initialization attempts, the driver takes it off line, unless the system disk is connected to the failing port or unless this computer is supposed to be a cluster member.
4	If the CI port is required for system disk access or cluster participation and all 50 reinitialization attempts have been used, then the computer bugchecks with a CIPORT-type bugcheck.

Once a CI port is off line, you can put the port back on line only by rebooting the computer.

C.11.5 LAN Device-Attention Entries

Example C-7 shows device-attention entries for the LAN. The left column gives the name of a device register or a memory location. The center column gives the value contained in that register or location, and the right column gives an interpretation of that value.

Example C-7 LAN Device-Attention Entry

************************* ENTRY 80. **************************** (1) ERROR SEQUENCE 26. LOGGED ON: SID 08000000 DATE/TIME 15-JAN-1994 11:30:53.07 SYS_TYPE 01010000 (2) DEVICE ATTENTION KA630 (3) SCS NODE: PHOBOS NI-SCS SUB-SYSTEM, PHOBOS$PEA0: (4) FATAL ERROR DETECTED BY DATALINK (5) STATUS1 0000002C (6) STATUS2 00000000 DATALINK UNIT 0001 (7) DATALINK NAME 41515803 (8) 00000000 00000000 00000000 DATALINK NAME = XQA1: REMOTE NODE 00000000 (9) 00000000 00000000 00000000 REMOTE ADDR 00000000 (10) 0000 LOCAL ADDR 000400AA (11) 4C07 ETHERNET ADDR = AA-00-04-00-07-4C ERROR CNT 0001 (12) 1. ERROR OCCURRENCES THIS ENTRY UCB$W_ERRCNT 0007 7. ERRORS THIS UNIT

The following table describes the LAN device-attention entries in Example C-7.

Entry Description

(1) The first two lines are the entry heading. These lines contain the number of the entry in this error log file, the sequence number of this error, and the identification number (SID) of this computer. Each entry in the log file contains such a heading.

(2) This line contains the date and time and the computer type.

(3) The next two lines contain the entry type, the processor type (KA630), and the computer's SCS node name.

(4) This line shows the name of the subsystem and component that caused the entry.

(5) This line shows the reason for the entry. The LAN driver has shut down the data link because of a fatal error. The data link will be restarted automatically, if possible.

(6) STATUS1 shows the I/O completion status returned by the LAN driver. STATUS2 is the VCI event code delivered to PEDRIVER by the LAN driver. The event values and meanings are described in the following table:

Event Code Meaning

1200 Port usable

1201 Port unusable

1202 Change address

If a message transmit was involved, the status applies to that transmit.

(7) DATALINK UNIT shows the unit number of the LAN device on which the error occurred.

(8) DATALINK NAME is the name of the LAN device on which the error occurred.

(9) REMOTE NODE is the name of the remote node to which the packet was being sent. If zeros are displayed, either no remote node was available or no packet was associated with the error.

(10) REMOTE ADDR is the LAN address of the remote node to which the packet was being sent. If zeros are displayed, no packet was associated with the error.

(11) LOCAL ADDR is the LAN address of the local node.

(12) ERROR CNT. Because some errors can occur at extremely high rates, some error log entries represent more than one occurrence of an error. This field indicates how many. The errors counted occurred in the 3 seconds preceding the timestamp on the entry.

Contents

Index

Legal

 
4477PRO_026.HTML