Document revision date: 30 March 2001
[Compaq] [Go to the documentation home page] [How to order documentation] [Help on this site] [How to contact us]
[OpenVMS documentation]

OpenVMS Cluster Systems


Previous Contents Index

F.11.3 Setting Up the Distributed Enable Filter

Use the values shown in Table F-15 to set up a filter, named Distrib_Enable, for the distributed enable packet received event. Use this filter to troubleshoot multiple LAN segments.

Table F-15 Setting Up a Distributed Enable Filter (Distrib_Enable)
Byte Number Field Value ASCII
1 DESTINATION 01--4C--41--56--63--45 .LAVcE
7 SOURCE xx--xx--xx--xx--xx--xx  
13 TYPE 60--07 `.
15 TEXT xx  

F.11.4 Setting Up the Distributed Trigger Filter

Use the values shown in Table F-16 to set up a filter, named Distrib_Trigger, for the distributed trigger packet received event. Use this filter to troubleshoot multiple LAN segments.

Table F-16 Setting Up the Distributed Trigger Filter (Distrib_Trigger)
Byte Number Field Value ASCII
1 DESTINATION 01--4C--41--56--63--54 .LAVcT
7 SOURCE xx--xx--xx--xx--xx--xx  
13 TYPE 60--07 `.
15 TEXT xx  

F.12 Messages

This section describes how to set up the distributed enable and distributed trigger messages.

F.12.1 Distributed Enable Message

Table F-17 shows how to define the distributed enable message (Distrib_Enable) by creating a new message. You must replace the source address (nn nn nn nn nn nn) with the LAN address of the LAN analyzer.

Table F-17 Setting Up the Distributed Enable Message (Distrib_Enable)
Field Byte Number Value ASCII
Destination 1 01 4C 41 56 63 45 .LAVcE
Source 7 nn nn nn nn nn nn  
Protocol 13 60 07 `.
Text 15 44 69 73 74 72 69 62 75 74 65 Distribute
  25 64 20 65 6E 61 62 6C 65 20 66 d enable f
  35 6F 72 20 74 72 6F 75 62 6C 65 or trouble
  45 73 68 6F 6F 74 69 6E 67 20 74 shooting t
  55 68 65 20 4C 6F 63 61 6C 20 41 he Local A
  65 72 65 61 20 56 4D 53 63 6C 75 rea VMSclu
  75 73 74 65 72 20 50 72 6F 74 6F ster Proto
  85 63 6F 6C 3A 20 4E 49 53 43 41 col: NISCA

F.12.2 Distributed Trigger Message

Table F-18 shows how to define the distributed trigger message (Distrib_Trigger) by creating a new message. You must replace the source address (nn nn nn nn nn nn) with the LAN address of the LAN analyzer.

Table F-18 Setting Up the Distributed Trigger Message (Distrib_Trigger)
Field Byte Number Value ASCII
Destination 1 01 4C 41 56 63 54 .LAVcT
Source 7 nn nn nn nn nn nn  
Protocol 13 60 07 `.
Text 15 44 69 73 74 72 69 62 75 74 65 Distribute
  25 64 20 74 72 69 67 67 65 72 20 d trigger
  35 66 6F 72 20 74 72 6F 75 62 6C for troubl
  45 65 73 68 6F 6F 74 69 6E 67 20 eshooting
  55 74 68 65 20 4C 6F 63 61 6C 20 the Local
  65 41 72 65 61 20 56 4D 53 63 6C Area VMScl
  75 75 73 74 65 72 20 50 72 6F 74 uster Prot
  85 6F 63 6F 6C 3A 20 4E 49 53 43 ocol: NISC
  95 41 A

F.13 Programs That Capture Retransmission Errors

You can program the HP 4972 LAN Protocol Analyzer, as shown in the following source code, to capture retransmission errors. The starter program initiates the capture across all of the LAN analyzers. Only one LAN analyzer should run a copy of the starter program. Other LAN analyzers should run either the partner program or the scribe program. The partner program is used when the initial location of the error is unknown and when all analyzers should cooperate in the detection of the error. Use the scribe program to trigger on a specific LAN segment as well as to capture data from other LAN segments.

F.13.1 Starter Program

The starter program initially sends the distributed enable signal to the other LAN analyzers. Next, this program captures all of the LAN traffic, and terminates as a result of either a retransmitted packet detected by this LAN analyzer or after receiving the distributed trigger sent from another LAN analyzer running the partner program.

The starter program shown in the following example is used to initiate data capture on multiple LAN segments using multiple LAN analyzers. The goal is to capture the data during the same time interval on all of the LAN segments so that the reason for the retransmission can be located.


Store: frames matching LAVc_all 
 or Distrib_Enable 
 or Distrib_Trigger 
       ending with LAVc_TR_ReXMT 
        or Distrib_Trigger 
 
Log file: not used 
 
Block 1:   Enable_the_other_analyzers 
     Send message Distrib_Enable 
       and then 
     Go to block 2 
 
Block 2:   Wait_for_the_event 
     When frame matches LAVc_TR_ReXMT then go to block 3 
 
Block 3:   Send the distributed trigger 
     Mark frame 
       and then 
     Send message Distrib_Trigger 

F.13.2 Partner Program

The partner program waits for the distributed enable; then it captures all of the LAN traffic and terminates as a result of either a retransmission or the distributed trigger. Upon termination, this program transmits the distributed trigger to make sure that other LAN analyzers also capture the data at about the same time as when the retransmitted packet was detected on this segment or another segment. After the data capture completes, the data from multiple LAN segments can be reviewed to locate the initial copy of the data that was retransmitted. The partner program is shown in the following example:


Store: frames matching LAVc_all 
        or Distrib_Enable 
        or Distrib_Trigger 
       ending with Distrib_Trigger 
 
Log file: not used 
 
Block 1:   Wait_for_distributed_enable 
     When frame matches Distrib_Enable then go to block 2 
 
Block 2:   Wait_for_the_event 
     When frame matches LAVc_TR_ReXMT then go to block 3 
 
Block 3:   Send the distributed trigger 
     Mark frame 
       and then 
     Send message Distrib_Trigger 

F.13.3 Scribe Program

The scribe program waits for the distributed enable and then captures all of the LAN traffic and terminates as a result of the distributed trigger. The scribe program allows a network manager to capture data at about the same time as when the retransmitted packet was detected on another segment. After the data capture has completed, the data from multiple LAN segments can be reviewed to locate the initial copy of the data that was retransmitted. The scribe program is shown in the following example:


Store: frames matching LAVc_all 
        or Distrib_Enable 
        or Distrib_Trigger 
       ending with Distrib_Trigger 
 
Log file: not used 
 
Block 1:   Wait_for_distributed_enable 
     When frame matches Distrib_Enable then go to block 2 
 
Block 2:   Wait_for_the_event 
     When frame matches LAVc_TR_ReXMT then go to block 3 
 
Block 3:   Mark_the_frames 
     Mark frame 
       and then 
     Go to block 2 


Appendix G
NISCA Transport Protocol Channel Selection and Congestion Control

G.1 NISCA Transmit Channel Selection

This appendix describes PEDRIVER running on OpenVMS Version 7.3 (Alpha and VAX) and PEDRIVER running on earliers versions of OpenVMS Alpha and VAX.

G.1.1 Multiple-Channel Load Distribution on OpenVMS Version 7.3 (Alpha and VAX) or Later

While all available channels with a node can be used to receive datagrams from that node, not all channels are necessarily used to transmit datagrams to that node. The NISCA protocol chooses a set of equally desirable channels to be used for datagram transmission, from the set of all available channels to a remote node. This set of transmit channels is called the equivalent channel set (ECS). Datagram transmissions are distributed in round-robin fashion across all the ECS members, thus maximizing internode cluster communications throughput.

G.1.1.1 Equivalent Channel Set Selection

When multiple node-to-node channels are available, the OpenVMS Cluster software bases the choice of which set of channels to use on the following criteria, which are evaluated in strict precedence order:

  1. Packet loss history
    Channels that have recently been losing LAN packets at a high rate are termed lossy and will be excluded from consideration. Channels that have an acceptable loss history are termed tight and will be further considered for use.
  2. Capacity
    Next, capacity criteria for the current set of tight channels are evaluated. The capacity criteria are:
    1. Priority
      Management priority values can be assigned both to individual channels and to local LAN devices. A channel's priority value is the sum of these management-assigned priority values. Only tight channels with a priority value equal to, or one less than, the highest priority value of any tight channel will be further considered for use.
    2. Packet size
      Tight, equivalent-priority channels whose maximum usable packet size is equivalent to that of the largest maximum usable packet size of any tight equivalent-priority channel will be further considered for use.

    A channel that satisfies all of these capacity criteria is classified as a peer. A channel that is deficient with respect to any capacity criteria is classified as inferior. A channel that exceeds one or more of the current capacity criteria, and meets the other capacity criteria is classified as superior.
    Note that detection of a superior channel will immediately result in recalculation of the capacity criteria for membership. This recalculation will result in the superior channel's capacity criteria becoming the ECS's capacity criteria, against which all tight channels will be evaluated.
    Similarly, if the last peer channel becomes unavailable or lossy, the capacity criteria for ECS membership will be recalculated. This will likely result in previously inferior channels becoming classified as peers.
    Channels whose capacity values have not been evaluated against the current ECS membership capacity criteria will sometimes be classified as ungraded. Since they cannot affect the current ECS membership criteria, lossy channels are marked as ungraded as a computational expedient when a complete recalculation of ECS membership is being performed.
  3. Delay
    Channels that meet the preceding ECS membership criteria will be used if their average round-trip delays are closely matched to that of the fastest such channel---that is, they are fast. A channel that does not meet the ECS membership delay criteria is considered slow.
    The delay of each channel currently in the ECS is measured using cluster communications traffic sent using that channel. If a channel has not been used to send a datagram for a few seconds, its delay will be measured using a round-trip handshake. Thus, a lossy or slow channel will be measured at intervals of a few seconds to determine whether its delay, or datagram loss rate, has improved enough so that it meets the ECS membership criteria.

Using the terminology introduced in this section, the ECS members are the current set of tight, peer, and fast channels.

G.1.1.2 Local and Remote LAN Adapter Load Distribution

Once the ECS member channels are selected, they are ordered using an algorithm that attempts to arrange them so as to use all local adapters for packet transmissions before returning to reuse a local adapter. Also, the ordering algorithm attempts to do the same with all remote LAN adapters. Once the order is established, it is used round robin for packet transmissions.

With these algorithms, PEDRIVER will make a best effort at utilizing multiple LAN adapters on a server node that communicates continuously with a client that also has multiple LAN adapters, as well as with a number of clients. In a two-node cluster, PEDRIVER will actively attempt to use all available LAN adapters that have usable LAN paths to the other node's LAN adapters, and that have comparable capacity values. Thus, additional adapters provide both higher availability and alternative paths that can be used to avoid network congestion.

G.1.2 Preferred Channel (OpenVMS Version 7.2 and Earlier)

This section describes the transmit-channel selection algorithm used by OpenVMS VAX and Alpha prior to OpenVMS Version 7.3.

All available channels with a node can be used to receive datagrams from that node. PEDRIVER chooses a single channel on which to transmit datagrams, from the set of available channels to a remote node.

The driver software chooses a transmission channel to each remote node. A selection algorithm for the transmission channel makes a best effor to ensure that messages are sent in the order they are expected to be received. Sending the messages in this way also maintains compatibility with previous versions of the operating system. The currently selected transmission channel is called the preferred channel.

At any point in time, the TR level of the NISCA protocol can modify its choice of a preferred channel based on the following:

PEDRIVER continually uses received HELLO messages to compute the incoming network delay value for each channel. Thus each channel's incoming delay is recalculated at intervals of ~2 to ~3 seconds. PEDRIVER then assumes that the network utilizes a broadcast medium (eg. An Ethernet wire, or an FDDI ring). Thus incoming and outgoing delays are symmetrical.

PEDRIVER switches the preferred channel based on observed network delays or network component failures. Switching to a new transmission channel sometimes causes messages to be received out of the desired order. PEDRIVER uses a receive resequencing cache to reorder these messages instead of discarding them, which eliminates unnecessary retransmissions.

With these algorithms, PEDRIVER has a greater chance of utilizing multiple adapters on a server node that communicates continuously with a number of clients. In a two-node cluster, PEDRIVER will actively use at most two LAN adapters: one to transmit and one to receive. Additional adapters provide both higher availability and alternative paths that can be used to avoid network congestion. As more nodes are added to the cluster, PEDRIVER is more likely to use the additional adapters.

G.2 NISCA Congestion Control

Network congestion occurs as the result of complex interactions of workload distribution and network topology, including the speed and buffer capacity of individual hardware components.

Network congestion can have a negative impact on cluster performance in several ways:

Thus, although a particular network component or protocol cannot guarantee the absence of congestion, the NISCA transport protocol implemented in PEDRIVER incorporates several mechanisms to mitigate the effects of congestion on OpenVMS Cluster traffic and to avoid having cluster traffic exacerbate congestion when it occurs. These mechanisms affect the retransmission of packets carrying user data and the multicast HELLO datagrams used to maintain connectivity.

G.2.1 Congestion Caused by Retransmission

Associated with each virtual circuit from a given node is a transmission window size, which indicates the number of packets that can be outstanding to the remote node (for example, the number of packets that can be sent to the node at the other end of the virtual circuit before receiving an acknowledgment [ACK]).

If the window size is 8 for a particular virtual circuit, then the sender can transmit up to 8 packets in a row but, before sending the ninth, must wait until receiving an ACK indicating that at least the first of the 8 has arrived.

If an ACK is not received, a timeout occurs, the packet is assumed lost, and must be retransmitted. If another timeout occurs for a retransmitted packet, the timeout interval is significantly increased and the packet is retransmitted again. After a large number of consecutive retransmissions of the same packet has occured, the virtual circuit will be closed.


Previous Next Contents Index

  [Go to the documentation home page] [How to order documentation] [Help on this site] [How to contact us]  
  privacy and legal statement  
4477PRO_034.HTML