Document revision date: 15 July 2002
[Compaq] [Go to the documentation home page] [How to order documentation] [Help on this site] [How to contact us]
[OpenVMS documentation]

OpenVMS Cluster Systems


Previous Contents Index

C.10.4 Verifying Virtual Circuits

To diagnose communication problems, you can invoke the Show Cluster utility using the instructions in Table C-4.

Table C-4 How to Verify Virtual Circuit States
Step Action What to Look for
1 Tailor the SHOW CLUSTER report by entering the SHOW CLUSTER command ADD CIRCUIT,CABLE_STATUS. This command adds a class of information about all the virtual circuits as seen from the computer on which you are running SHOW CLUSTER. CABLE_STATUS indicates the status of the path for the circuit from the CI interface on the local system to the CI interface on the remote system. Primarily, you are checking whether there is a virtual circuit in the OPEN state to the failing computer. Common causes of failure to open a virtual circuit and keep it open are the following:
  • Port errors on one side or the other
  • Cabling errors
  • A port set off line because of software problems
  • Insufficient nonpaged pool on both sides
  • Failure to set correct values for the SCSNODE, SCSSYSTEMID, PAMAXPORT, PANOPOLL, PASTIMOUT, and PAPOLLINTERVAL system parameters
2 Run SHOW CLUSTER from each active computer in the cluster to verify whether each computer's view of the failing computer is consistent with every other computer's view.
WHEN... THEN...
All the active computers have a consistent view of the failing computer The problem may be in the failing computer.
Only one of several active computers detects that the newcomer is failing That particular computer may have a problem.
If no virtual circuit is open to the failing computer, check the bottom of the SHOW CLUSTER display:
  • For information about circuits to the port of the failing computer. Virtual circuits in partially open states are shown at the bottom of the display. If the circuit is shown in a state other than OPEN, communications between the local and remote ports are taking place, and the failure is probably at a higher level than in port or cable hardware.
  • To see whether both path A and path B to the failing port are good. The loss of one path should not prevent a computer from participating in a cluster.

C.10.5 Verifying CI Cable Connections

Whenever the configuration poller finds that no virtual circuits are open and that no handshake procedures are currently opening virtual circuits, the poller analyzes its environment. It does so by using the send-loopback-datagram facility of the CI port in the following fashion:

  1. The send-loopback-datagram facility tests the connections between the CI port and the star coupler by routing messages across them. The messages are called loopback datagrams. (The port processes other self-directed messages without using the star coupler or external cables.)
  2. The configuration poller makes entries in the error log whenever it detects a change in the state of a circuit. Note, however, that it is possible two changed-to-failed-state messages can be entered in the log without an intervening changed-to-succeeded-state message. Such a series of entries means that the circuit state continues to be faulty.

C.10.6 Diagnosing CI Cabling Problems

The following paragraphs discuss various incorrect CI cabling configurations and the entries made in the error log when these configurations exist. Figure C-1 shows a two-computer configuration with all cables correctly connected. Figure C-2 shows a CI cluster with a pair of crossed cables.

Figure C-1 Correctly Connected Two-Computer CI Cluster


Figure C-2 Crossed CI Cable Pair


If a pair of transmitting cables or a pair of receiving cables is crossed, a message sent on TA is received on RB, and a message sent on TB is received on RA. This is a hardware error condition from which the port cannot recover. An entry is made in the error log indicating that a single pair of crossed cables exists. The entry contains the following lines:


DATA CABLE(S) CHANGE OF STATE 
PATH  1.  LOOPBACK HAS GONE FROM GOOD TO BAD 

If this situation exists, you can correct it by reconnecting the cables properly. The cables could be misconnected in several places. The coaxial cables that connect the port boards to the bulkhead cable connectors can be crossed, or the cables can be misconnected to the bulkhead or the star coupler.

Configuration 1: The information illustrated in Figure C-2 is represented more simply in Example C-1. It shows the cables positioned as in Figure C-2, but it does not show the star coupler or the computers. The labels LOC (local) and REM (remote) indicate the pairs of transmitting (T) and receiving (R) cables on the local and remote computers, respectively.

Example C-1 Crossed Cables: Configuration 1

T x   = R 
 
R =   = T 
 
LOC   REM 

The pair of crossed cables causes loopback datagrams to fail on the local computer but to succeed on the remote computer. Crossed pairs of transmitting cables and crossed pairs of receiving cables cause the same behavior.

Note that only an odd number of crossed cable pairs causes these problems. If an even number of cable pairs is crossed, communications succeed. An error log entry is made in some cases, however, and the contents of the entry depends on which pairs of cables are crossed.

Configuration 2: Example C-2 shows two-computer clusters with the combinations of two crossed cable pairs. These crossed pairs cause the following entry to be made in the error log of the computer that has the cables crossed:


DATA CABLE(S) CHANGE OF STATE 
CABLES HAVE GONE FROM UNCROSSED TO CROSSED 

Loopback datagrams succeed on both computers, and communications are possible.

Example C-2 Crossed Cables: Configuration 2

T x   = R        T =   x R 
 
R x   = T        R =   x T 
 
LOC   REM        LOC   REM 

Configuration 3: Example C-3 shows the possible combinations of two pairs of crossed cables that cause loopback datagrams to fail on both computers in the cluster. Communications can still take place between the computers. An entry stating that cables are crossed is made in the error log of each computer.

Example C-3 Crossed Cables: Configuration 3

T x   = R        T =   x R 
 
R =   x T        R x   = T 
 
LOC   REM        LOC   REM 

Configuration 4: Example C-4 shows the possible combinations of two pairs of crossed cables that cause loopback datagrams to fail on both computers in the cluster but that allow communications. No entry stating that cables are crossed is made in the error log of either computer.

Example C-4 Crossed Cables: Configuration 4

T x   x R        T =   = R 
 
R =   = T        R x   x T 
 
LOC   REM        LOC   REM 

Configuration 5: Example C-5 shows the possible combinations of four pairs of crossed cables. In each case, loopback datagrams fail on the computer that has only one crossed pair of cables. Loopback datagrams succeed on the computer with both pairs crossed. No communications are possible.

Example C-5 Crossed Cables: Configuration 5

T x   x R        T x   = R        T =   x R        T x   x R 
 
R x   = T        R x   x T        R x   x T        R =   x T 
 
LOC   REM        LOC   REM        LOC   REM        LOC   REM 

If all four cable pairs between two computers are crossed, communications succeed, loopback datagrams succeed, and no crossed-cable message entries are made in the error log. You might detect such a condition by noting error log entries made by a third computer in the cluster, but this occurs only if the third computer has one of the crossed-cable cases described.

C.10.7 Repairing CI Cables

This section describes some ways in which Compaq support representatives can make repairs on a running computer. This information is provided to aid system managers in scheduling repairs.

For cluster software to survive cable-checking activities or cable-replacement activities, you must be sure that either path A or path B is intact at all times between each port and between every other port in the cluster.

For example, you can remove path A and path B in turn from a particular port to the star coupler. To make sure that the configuration poller finds a path that was previously faulty but is now operational, follow these steps:
Step Action
1 Remove path B.
2 After the poller has discovered that path B is faulty, reconnect path B.
3 Wait two poller intervals, 1 and then take either of the following actions:
  • Enter the DCL command SHOW CLUSTER to make sure that the poller has reestablished path B.
  • Enter the DCL command SHOW CLUSTER/CONTINUOUS followed by the SHOW CLUSTER command ADD CIRCUITS, CABLE_ST.
4 Wait for SHOW CLUSTER to tell you that path B has been reestablished.
5 Remove path A.
6 After the poller has discovered that path A is faulty, reconnect path A.
7 Wait two poller intervals 1 to make sure that the poller has reestablished path A.


1Approximately 10 seconds at the default system parameter settings

If both paths are lost at the same time, the virtual circuits are lost between the port with the broken cables and all other ports in the cluster. This condition will in turn result in loss of SCS connections over the broken virtual circuits. However, recovery from this situation is automatic after an interruption in service on the affected computer. The length of the interruption varies, but it is approximately two poller intervals at the default system parameter settings.

C.10.8 Verifying LAN Connections

The Local Area OpenVMS Cluster Network Failure Analysis Program described in Section D.4 uses the HELLO datagram messages to verify continuously the network paths (channels) used by PEDRIVER. This verification process, combined with physical description of the network, can:

C.11 Analyzing Error-Log Entries for Port Devices

Monitoring events recorded in the error log can help you anticipate and avoid potential problems. From the total error count (displayed by the DCL command SHOW DEVICES device-name), you can determine whether errors are increasing. If so, you should examine the error log.

C.11.1 Examine the Error Log

The DCL command ANALYZE/ERROR_LOG invokes the Error Log utility to report the contents of an error-log file.

Reference: For more information about the Error Log utility, see the OpenVMS System Management Utilities Reference Manual.

Some error-log entries are informational only while others require action.

Table C-5 Informational and Other Error-Log Entries
Error Type Action Required? Purpose
Informational error-log entries require no action. For example, if you shut down a computer in the cluster, all other active computers that have open virtual circuits between themselves and the computer that has been shut down make entries in their error logs. Such computers record up to three errors for the event:
  • Path A received no response.
  • Path B received no response.
  • The virtual circuit is being closed.
No These messages are normal and reflect the change of state in the circuits to the computer that has been shut down.
Other error-log entries indicate problems that degrade operation or nonfatal hardware problems. The operating system might continue to run satisfactorily under these conditions. Yes Detecting these problems early is important to preventing nonfatal problems (such as loss of a single CI path) from becoming serious problems (such as loss of both paths).

C.11.2 Formats

Errors and other events on the CI, DSSI, or LAN cause port drivers to enter information in the system error log in one of two formats:

Sections C.11.3 and C.11.6 describe those formats.

C.11.3 CI Device-Attention Entries

Example C-6 shows device-attention entries for the CI. The left column gives the name of a device register or a memory location. The center column gives the value contained in that register or location, and the right column gives an interpretation of that value.

Example C-6 CI Device-Attention Entries

************************* ENTRY    83. **************************** (1)
ERROR SEQUENCE 10.                     LOGGED ON:      SID 0150400A 
DATE/TIME 15-JAN-1994 11:45:27.61                 SYS_TYPE 01010000 (2)
DEVICE ATTENTION    KA780                                           (3)
                    SCS NODE: MARS 
 
CI SUB-SYSTEM, MARS$PAA0: - PORT POWER DOWN                         (4)
 
      CNFGR           00800038 
                                      ADAPTER IS CI 
                                      ADAPTER POWER-DOWN 
      PMCSR           000000CE 
                                      MAINTENANCE TIMER DISABLE 
                                      MAINTENANCE INTERRUPT ENABLE 
                                      MAINTENANCE INTERRUPT FLAG 
                                      PROGRAMMABLE STARTING ADDRESS 
                                      UNINITIALIZED STATE 
      PSR             80000001 
                                      RESPONSE QUEUE AVAILABLE 
                                      MAINTENANCE ERROR 
      PFAR            00000000 
      PESR            00000000 
      PPR             03F80001 
 
      UCB$B_ERTCNT          32                                      (5)
                                      50. RETRIES REMAINING 
      UCB$B_ERTMAX          32                                      (6)
                                      50. RETRIES ALLOWABLE 
      UCB$L_CHAR      0C450000 
                                      SHAREABLE 
                                      AVAILABLE 
                                      ERROR LOGGING 
                                      CAPABLE OF INPUT 
                                      CAPABLE OF OUTPUT 
      UCB$W_STS           0010 
                                      ONLINE 
      UCB$W_ERRCNT        000B                                      (7)
                                      11. ERRORS THIS UNIT 
 

The following table describes the device-attention entries in Example C-6.
Entry Description
(1) The first two lines are the entry heading. These lines contain the number of the entry in this error log file, the sequence number of this error, and the identification number (SID) of this computer. Each entry in the log file contains such a heading.
(2) This line contains the date, the time, and the computer type.
(3) The next two lines contain the entry type, the processor type (KA780), and the computer's SCS node name.
(4) This line shows the name of the subsystem and the device that caused the entry and the reason for the entry. The CI subsystem's device PAA0 on MARS was powered down.

The next 15 lines contain the names of hardware registers in the port, their contents, and interpretations of those contents. See the appropriate CI hardware manual for a description of all the CI port registers.

(5) The UCB$B_ERTCNT field contains the number of reinitializations that the port driver can still attempt. The difference between this value and UCB$B_ERTMAX is the number of reinitializations already attempted.
(6) The UCB$B_ERTMAX field contains the maximum number of times the port can be reinitialized by the port driver.
(7) The UCB$W_ERRCNT field contains the total number of errors that have occurred on this port since it was booted. This total includes both errors that caused reinitialization of the port and errors that did not.

C.11.4 Error Recovery

The CI port can recover from many errors, but not all. When an error occurs from which the CI cannot recover, the following process occurs:
Step Action
1 The port notifies the port driver.
2 The port driver logs the error and attempts to reinitialize the port.
3 If the port fails after 50 such initialization attempts, the driver takes it off line, unless the system disk is connected to the failing port or unless this computer is supposed to be a cluster member.
4 If the CI port is required for system disk access or cluster participation and all 50 reinitialization attempts have been used, then the computer bugchecks with a CIPORT-type bugcheck.

Once a CI port is off line, you can put the port back on line only by rebooting the computer.

C.11.5 LAN Device-Attention Entries

Example C-7 shows device-attention entries for the LAN. The left column gives the name of a device register or a memory location. The center column gives the value contained in that register or location, and the right column gives an interpretation of that value.

Example C-7 LAN Device-Attention Entry

************************* ENTRY   80. ****************************  (1)
ERROR SEQUENCE 26.                    LOGGED ON:      SID 08000000 
DATE/TIME 15-JAN-1994 11:30:53.07                SYS_TYPE 01010000  (2)
DEVICE ATTENTION  KA630                                             (3)
                  SCS NODE: PHOBOS 
NI-SCS SUB-SYSTEM, PHOBOS$PEA0:                                     (4)
       FATAL ERROR DETECTED BY DATALINK                             (5)
 
       STATUS1         0000002C                                     (6)
       STATUS2         00000000 
       DATALINK UNIT       0001                                     (7)
       DATALINK NAME   41515803                                     (8)
                       00000000 
                       00000000 
                       00000000 
                                       DATALINK NAME = XQA1: 
       REMOTE NODE     00000000                                     (9)
                       00000000 
                       00000000 
                       00000000 
       REMOTE ADDR     00000000                                     (10)
                           0000 
       LOCAL ADDR      000400AA                                     (11)
                           4C07 
                                       ETHERNET ADDR = AA-00-04-00-07-4C 
       ERROR CNT           0001                                     (12)
                                       1. ERROR OCCURRENCES THIS ENTRY 
       UCB$W_ERRCNT        0007 
                                       7. ERRORS THIS UNIT 

The following table describes the LAN device-attention entries in Example C-7.
Entry Description
(1) The first two lines are the entry heading. These lines contain the number of the entry in this error log file, the sequence number of this error, and the identification number (SID) of this computer. Each entry in the log file contains such a heading.
(2) This line contains the date and time and the computer type.
(3) The next two lines contain the entry type, the processor type (KA630), and the computer's SCS node name.
(4) This line shows the name of the subsystem and component that caused the entry.
(5) This line shows the reason for the entry. The LAN driver has shut down the data link because of a fatal error. The data link will be restarted automatically, if possible.
(6) STATUS1 shows the I/O completion status returned by the LAN driver. STATUS2 is the VCI event code delivered to PEDRIVER by the LAN driver. The event values and meanings are described in the following table:
Event Code Meaning
1200 Port usable
1201 Port unusable
1202 Change address

If a message transmit was involved, the status applies to that transmit.

(7) DATALINK UNIT shows the unit number of the LAN device on which the error occurred.
(8) DATALINK NAME is the name of the LAN device on which the error occurred.
(9) REMOTE NODE is the name of the remote node to which the packet was being sent. If zeros are displayed, either no remote node was available or no packet was associated with the error.
(10) REMOTE ADDR is the LAN address of the remote node to which the packet was being sent. If zeros are displayed, no packet was associated with the error.
(11) LOCAL ADDR is the LAN address of the local node.
(12) ERROR CNT. Because some errors can occur at extremely high rates, some error log entries represent more than one occurrence of an error. This field indicates how many. The errors counted occurred in the 3 seconds preceding the timestamp on the entry.


Previous Next Contents Index

  [Go to the documentation home page] [How to order documentation] [Help on this site] [How to contact us]  
  privacy and legal statement  
4477PRO_026.HTML