Document revision date: 30 March 2001 | |
Previous | Contents | Index |
To diagnose communication problems, you can invoke the Show Cluster utility using the instructions in Table C-4.
Step | Action | What to Look for | ||||||
---|---|---|---|---|---|---|---|---|
1 | Tailor the SHOW CLUSTER report by entering the SHOW CLUSTER command ADD CIRCUIT,CABLE_STATUS. This command adds a class of information about all the virtual circuits as seen from the computer on which you are running SHOW CLUSTER. CABLE_STATUS indicates the status of the path for the circuit from the CI interface on the local system to the CI interface on the remote system. |
Primarily, you are checking whether there is a virtual circuit in the
OPEN state to the failing computer. Common causes of failure to open a
virtual circuit and keep it open are the following:
|
||||||
2 |
Run SHOW CLUSTER from each active computer in the cluster to verify
whether each computer's view of the failing computer is consistent with
every other computer's view.
|
If no virtual circuit is open to the failing computer, check the bottom
of the SHOW CLUSTER display:
|
Whenever the configuration poller finds that no virtual circuits are open and that no handshake procedures are currently opening virtual circuits, the poller analyzes its environment. It does so by using the send-loopback-datagram facility of the CI port in the following fashion:
The following paragraphs discuss various incorrect CI cabling configurations and the entries made in the error log when these configurations exist. Figure C-1 shows a two-computer configuration with all cables correctly connected. Figure C-2 shows a CI cluster with a pair of crossed cables.
Figure C-1 Correctly Connected Two-Computer CI Cluster
Figure C-2 Crossed CI Cable Pair
If a pair of transmitting cables or a pair of receiving cables is crossed, a message sent on TA is received on RB, and a message sent on TB is received on RA. This is a hardware error condition from which the port cannot recover. An entry is made in the error log indicating that a single pair of crossed cables exists. The entry contains the following lines:
DATA CABLE(S) CHANGE OF STATE PATH 1. LOOPBACK HAS GONE FROM GOOD TO BAD |
If this situation exists, you can correct it by reconnecting the cables properly. The cables could be misconnected in several places. The coaxial cables that connect the port boards to the bulkhead cable connectors can be crossed, or the cables can be misconnected to the bulkhead or the star coupler.
Configuration 1: The information illustrated in Figure C-2 is represented more simply in Example C-1. It shows the cables positioned as in Figure C-2, but it does not show the star coupler or the computers. The labels LOC (local) and REM (remote) indicate the pairs of transmitting (T) and receiving (R) cables on the local and remote computers, respectively.
Example C-1 Crossed Cables: Configuration 1 |
---|
T x = R R = = T LOC REM |
The pair of crossed cables causes loopback datagrams to fail on the local computer but to succeed on the remote computer. Crossed pairs of transmitting cables and crossed pairs of receiving cables cause the same behavior.
Note that only an odd number of crossed cable pairs causes these problems. If an even number of cable pairs is crossed, communications succeed. An error log entry is made in some cases, however, and the contents of the entry depends on which pairs of cables are crossed.
Configuration 2: Example C-2 shows two-computer clusters with the combinations of two crossed cable pairs. These crossed pairs cause the following entry to be made in the error log of the computer that has the cables crossed:
DATA CABLE(S) CHANGE OF STATE CABLES HAVE GONE FROM UNCROSSED TO CROSSED |
Loopback datagrams succeed on both computers, and communications are possible.
Example C-2 Crossed Cables: Configuration 2 |
---|
T x = R T = x R R x = T R = x T LOC REM LOC REM |
Configuration 3: Example C-3 shows the possible combinations of two pairs of crossed cables that cause loopback datagrams to fail on both computers in the cluster. Communications can still take place between the computers. An entry stating that cables are crossed is made in the error log of each computer.
Example C-3 Crossed Cables: Configuration 3 |
---|
T x = R T = x R R = x T R x = T LOC REM LOC REM |
Configuration 4: Example C-4 shows the possible combinations of two pairs of crossed cables that cause loopback datagrams to fail on both computers in the cluster but that allow communications. No entry stating that cables are crossed is made in the error log of either computer.
Example C-4 Crossed Cables: Configuration 4 |
---|
T x x R T = = R R = = T R x x T LOC REM LOC REM |
Configuration 5: Example C-5 shows the possible combinations of four pairs of crossed cables. In each case, loopback datagrams fail on the computer that has only one crossed pair of cables. Loopback datagrams succeed on the computer with both pairs crossed. No communications are possible.
Example C-5 Crossed Cables: Configuration 5 |
---|
T x x R T x = R T = x R T x x R R x = T R x x T R x x T R = x T LOC REM LOC REM LOC REM LOC REM |
If all four cable pairs between two computers are crossed,
communications succeed, loopback datagrams succeed, and no
crossed-cable message entries are made in the error log. You might
detect such a condition by noting error log entries made by a third
computer in the cluster, but this occurs only if the third computer has
one of the crossed-cable cases described.
C.10.7 Repairing CI Cables
This section describes some ways in which Compaq support representatives can make repairs on a running computer. This information is provided to aid system managers in scheduling repairs.
For cluster software to survive cable-checking activities or cable-replacement activities, you must be sure that either path A or path B is intact at all times between each port and between every other port in the cluster.
For example, you can remove path A and path B in turn from a particular port to the star coupler. To make sure that the configuration poller finds a path that was previously faulty but is now operational, follow these steps:
Step | Action |
---|---|
1 | Remove path B. |
2 | After the poller has discovered that path B is faulty, reconnect path B. |
3 |
Wait two poller intervals,
1 and then take either of the following actions:
|
4 | Wait for SHOW CLUSTER to tell you that path B has been reestablished. |
5 | Remove path A. |
6 | After the poller has discovered that path A is faulty, reconnect path A. |
7 | Wait two poller intervals 1 to make sure that the poller has reestablished path A. |
If both paths are lost at the same time, the virtual circuits are lost
between the port with the broken cables and all other ports in the
cluster. This condition will in turn result in loss of SCS connections
over the broken virtual circuits. However, recovery from this situation
is automatic after an interruption in service on the affected computer.
The length of the interruption varies, but it is approximately two
poller intervals at the default system parameter settings.
C.10.8 Verifying LAN Connections
The Local Area OpenVMS Cluster Network Failure Analysis Program described in Section D.4 uses the HELLO datagram messages to verify continuously the network paths (channels) used by PEDRIVER. This verification process, combined with physical description of the network, can:
Monitoring events recorded in the error log can help you anticipate and
avoid potential problems. From the total error count (displayed by the
DCL command SHOW DEVICES device-name), you can determine
whether errors are increasing. If so, you should examine the error log.
C.11.1 Examine the Error Log
The DCL command ANALYZE/ERROR_LOG invokes the Error Log utility to report the contents of an error-log file.
Reference: For more information about the Error Log utility, see the OpenVMS System Management Utilities Reference Manual.
Some error-log entries are informational only while others require action.
Error Type | Action Required? | Purpose |
---|---|---|
Informational error-log entries require no action. For
example, if you shut down a computer in the cluster, all other active
computers that have open virtual circuits between themselves and the
computer that has been shut down make entries in their error logs. Such
computers record up to three errors for the event:
|
No | These messages are normal and reflect the change of state in the circuits to the computer that has been shut down. |
Other error-log entries indicate problems that degrade operation or nonfatal hardware problems. The operating system might continue to run satisfactorily under these conditions. | Yes | Detecting these problems early is important to preventing nonfatal problems (such as loss of a single CI path) from becoming serious problems (such as loss of both paths). |
Errors and other events on the CI, DSSI, or LAN cause port drivers to enter information in the system error log in one of two formats:
Sections C.11.3 and C.11.6 describe those formats.
C.11.3 CI Device-Attention Entries
Example C-6 shows device-attention entries for the CI. The left column gives the name of a device register or a memory location. The center column gives the value contained in that register or location, and the right column gives an interpretation of that value.
Example C-6 CI Device-Attention Entries |
---|
************************* ENTRY 83. **************************** (1) ERROR SEQUENCE 10. LOGGED ON: SID 0150400A DATE/TIME 15-JAN-1994 11:45:27.61 SYS_TYPE 01010000 (2) DEVICE ATTENTION KA780 (3) SCS NODE: MARS CI SUB-SYSTEM, MARS$PAA0: - PORT POWER DOWN (4) CNFGR 00800038 ADAPTER IS CI ADAPTER POWER-DOWN PMCSR 000000CE MAINTENANCE TIMER DISABLE MAINTENANCE INTERRUPT ENABLE MAINTENANCE INTERRUPT FLAG PROGRAMMABLE STARTING ADDRESS UNINITIALIZED STATE PSR 80000001 RESPONSE QUEUE AVAILABLE MAINTENANCE ERROR PFAR 00000000 PESR 00000000 PPR 03F80001 UCB$B_ERTCNT 32 (5) 50. RETRIES REMAINING UCB$B_ERTMAX 32 (6) 50. RETRIES ALLOWABLE UCB$L_CHAR 0C450000 SHAREABLE AVAILABLE ERROR LOGGING CAPABLE OF INPUT CAPABLE OF OUTPUT UCB$W_STS 0010 ONLINE UCB$W_ERRCNT 000B (7) 11. ERRORS THIS UNIT |
The following table describes the device-attention entries in Example C-6.
Entry | Description |
---|---|
(1) | The first two lines are the entry heading. These lines contain the number of the entry in this error log file, the sequence number of this error, and the identification number (SID) of this computer. Each entry in the log file contains such a heading. |
(2) | This line contains the date, the time, and the computer type. |
(3) | The next two lines contain the entry type, the processor type (KA780), and the computer's SCS node name. |
(4) |
This line shows the name of the subsystem and the device that caused
the entry and the reason for the entry. The CI subsystem's device PAA0
on MARS was powered down.
The next 15 lines contain the names of hardware registers in the port, their contents, and interpretations of those contents. See the appropriate CI hardware manual for a description of all the CI port registers. |
(5) | The UCB$B_ERTCNT field contains the number of reinitializations that the port driver can still attempt. The difference between this value and UCB$B_ERTMAX is the number of reinitializations already attempted. |
(6) | The UCB$B_ERTMAX field contains the maximum number of times the port can be reinitialized by the port driver. |
(7) | The UCB$W_ERRCNT field contains the total number of errors that have occurred on this port since it was booted. This total includes both errors that caused reinitialization of the port and errors that did not. |
The CI port can recover from many errors, but not all. When an error occurs from which the CI cannot recover, the following process occurs:
Step | Action |
---|---|
1 | The port notifies the port driver. |
2 | The port driver logs the error and attempts to reinitialize the port. |
3 | If the port fails after 50 such initialization attempts, the driver takes it off line, unless the system disk is connected to the failing port or unless this computer is supposed to be a cluster member. |
4 | If the CI port is required for system disk access or cluster participation and all 50 reinitialization attempts have been used, then the computer bugchecks with a CIPORT-type bugcheck. |
Once a CI port is off line, you can put the port back on line only by
rebooting the computer.
C.11.5 LAN Device-Attention Entries
Example C-7 shows device-attention entries for the LAN. The left column gives the name of a device register or a memory location. The center column gives the value contained in that register or location, and the right column gives an interpretation of that value.
Example C-7 LAN Device-Attention Entry |
---|
************************* ENTRY 80. **************************** (1) ERROR SEQUENCE 26. LOGGED ON: SID 08000000 DATE/TIME 15-JAN-1994 11:30:53.07 SYS_TYPE 01010000 (2) DEVICE ATTENTION KA630 (3) SCS NODE: PHOBOS NI-SCS SUB-SYSTEM, PHOBOS$PEA0: (4) FATAL ERROR DETECTED BY DATALINK (5) STATUS1 0000002C (6) STATUS2 00000000 DATALINK UNIT 0001 (7) DATALINK NAME 41515803 (8) 00000000 00000000 00000000 DATALINK NAME = XQA1: REMOTE NODE 00000000 (9) 00000000 00000000 00000000 REMOTE ADDR 00000000 (10) 0000 LOCAL ADDR 000400AA (11) 4C07 ETHERNET ADDR = AA-00-04-00-07-4C ERROR CNT 0001 (12) 1. ERROR OCCURRENCES THIS ENTRY UCB$W_ERRCNT 0007 7. ERRORS THIS UNIT |
The following table describes the LAN device-attention entries in Example C-7.
Entry | Description | ||||||||
---|---|---|---|---|---|---|---|---|---|
(1) | The first two lines are the entry heading. These lines contain the number of the entry in this error log file, the sequence number of this error, and the identification number (SID) of this computer. Each entry in the log file contains such a heading. | ||||||||
(2) | This line contains the date and time and the computer type. | ||||||||
(3) | The next two lines contain the entry type, the processor type (KA630), and the computer's SCS node name. | ||||||||
(4) | This line shows the name of the subsystem and component that caused the entry. | ||||||||
(5) | This line shows the reason for the entry. The LAN driver has shut down the data link because of a fatal error. The data link will be restarted automatically, if possible. | ||||||||
(6) |
STATUS1 shows the I/O completion status returned by the LAN driver.
STATUS2 is the VCI event code delivered to PEDRIVER by the LAN driver.
The event values and meanings are described in the following table:
If a message transmit was involved, the status applies to that transmit. |
||||||||
(7) | DATALINK UNIT shows the unit number of the LAN device on which the error occurred. | ||||||||
(8) | DATALINK NAME is the name of the LAN device on which the error occurred. | ||||||||
(9) | REMOTE NODE is the name of the remote node to which the packet was being sent. If zeros are displayed, either no remote node was available or no packet was associated with the error. | ||||||||
(10) | REMOTE ADDR is the LAN address of the remote node to which the packet was being sent. If zeros are displayed, no packet was associated with the error. | ||||||||
(11) | LOCAL ADDR is the LAN address of the local node. | ||||||||
(12) | ERROR CNT. Because some errors can occur at extremely high rates, some error log entries represent more than one occurrence of an error. This field indicates how many. The errors counted occurred in the 3 seconds preceding the timestamp on the entry. |
Previous | Next | Contents | Index |
privacy and legal statement | ||
4477PRO_026.HTML |