OpenVMS Cluster Systems

Document revision date: 19 July 1999

OpenVMS Cluster Systems

Contents

Index

F.11.3 Setting Up the Distributed Enable Filter

Use the values shown in Table F-15 to set up a filter, named Distrib_Enable, for the distributed enable packet received event. Use this filter to troubleshoot multiple LAN segments.

Table F-15 Setting Up a Distributed Enable Filter (Distrib_Enable)
Byte Number Field Value ASCII

1 DESTINATION 01--4C--41--56--63--45 .LAVcE

7 SOURCE xx--xx--xx--xx--xx--xx

13 TYPE 60--07 `.

15 TEXT xx

**Table F-15 Setting Up a Distributed Enable Filter (Distrib_Enable)**
Byte Number	Field	Value	ASCII
1	DESTINATION	01--4C--41--56--63--45	.LAVcE
7	SOURCE	xx--xx--xx--xx--xx--xx
13	TYPE	60--07	`.
15	TEXT	xx

F.11.4 Setting Up the Distributed Trigger Filter

Use the values shown in Table F-16 to set up a filter, named Distrib_Trigger, for the distributed trigger packet received event. Use this filter to troubleshoot multiple LAN segments.

Table F-16 Setting Up the Distributed Trigger Filter (Distrib_Trigger)
Byte Number Field Value ASCII

1 DESTINATION 01--4C--41--56--63--54 .LAVcT

7 SOURCE xx--xx--xx--xx--xx--xx

13 TYPE 60--07 `.

15 TEXT xx

**Table F-16 Setting Up the Distributed Trigger Filter (Distrib_Trigger)**
Byte Number	Field	Value	ASCII
1	DESTINATION	01--4C--41--56--63--54	.LAVcT
7	SOURCE	xx--xx--xx--xx--xx--xx
13	TYPE	60--07	`.
15	TEXT	xx

F.12 Messages

This section describes how to set up the distributed enable and distributed trigger messages.

F.12.1 Distributed Enable Message

Table F-17 shows how to define the distributed enable message (Distrib_Enable) by creating a new message. You must replace the source address (nn nn nn nn nn nn) with the LAN address of the LAN analyzer.

Table F-17 Setting Up the Distributed Enable Message (Distrib_Enable)
Field Byte Number Value ASCII

Destination 1 01 4C 41 56 63 45 .LAVcE

Source 7 nn nn nn nn nn nn

Protocol 13 60 07 `.

Text 15 44 69 73 74 72 69 62 75 74 65 Distribute

25 64 20 65 6E 61 62 6C 65 20 66 d enable f

35 6F 72 20 74 72 6F 75 62 6C 65 or trouble

45 73 68 6F 6F 74 69 6E 67 20 74 shooting t

55 68 65 20 4C 6F 63 61 6C 20 41 he Local A

65 72 65 61 20 56 4D 53 63 6C 75 rea VMSclu

75 73 74 65 72 20 50 72 6F 74 6F ster Proto

85 63 6F 6C 3A 20 4E 49 53 43 41 col: NISCA

**Table F-17 Setting Up the Distributed Enable Message (Distrib_Enable)**
Field	Byte Number	Value	ASCII
Destination	1	01 4C 41 56 63 45	.LAVcE
Source	7	nn nn nn nn nn nn
Protocol	13	60 07	`.
Text	15	44 69 73 74 72 69 62 75 74 65	Distribute
	25	64 20 65 6E 61 62 6C 65 20 66	d enable f
	35	6F 72 20 74 72 6F 75 62 6C 65	or trouble
	45	73 68 6F 6F 74 69 6E 67 20 74	shooting t
	55	68 65 20 4C 6F 63 61 6C 20 41	he Local A
	65	72 65 61 20 56 4D 53 63 6C 75	rea VMSclu
	75	73 74 65 72 20 50 72 6F 74 6F	ster Proto
	85	63 6F 6C 3A 20 4E 49 53 43 41	col: NISCA

F.12.2 Distributed Trigger Message

Table F-18 shows how to define the distributed trigger message (Distrib_Trigger) by creating a new message. You must replace the source address (nn nn nn nn nn nn) with the LAN address of the LAN analyzer.

Table F-18 Setting Up the Distributed Trigger Message (Distrib_Trigger)
Field Byte Number Value ASCII

Destination 1 01 4C 41 56 63 54 .LAVcT

Source 7 nn nn nn nn nn nn

Protocol 13 60 07 `.

Text 15 44 69 73 74 72 69 62 75 74 65 Distribute

25 64 20 74 72 69 67 67 65 72 20 d trigger

35 66 6F 72 20 74 72 6F 75 62 6C for troubl

45 65 73 68 6F 6F 74 69 6E 67 20 eshooting

55 74 68 65 20 4C 6F 63 61 6C 20 the Local

65 41 72 65 61 20 56 4D 53 63 6C Area VMScl

75 75 73 74 65 72 20 50 72 6F 74 uster Prot

85 6F 63 6F 6C 3A 20 4E 49 53 43 ocol: NISC

95 41 A

**Table F-18 Setting Up the Distributed Trigger Message (Distrib_Trigger)**
Field	Byte Number	Value	ASCII
Destination	1	01 4C 41 56 63 54	.LAVcT
Source	7	nn nn nn nn nn nn
Protocol	13	60 07	`.
Text	15	44 69 73 74 72 69 62 75 74 65	Distribute
	25	64 20 74 72 69 67 67 65 72 20	d trigger
	35	66 6F 72 20 74 72 6F 75 62 6C	for troubl
	45	65 73 68 6F 6F 74 69 6E 67 20	eshooting
	55	74 68 65 20 4C 6F 63 61 6C 20	the Local
	65	41 72 65 61 20 56 4D 53 63 6C	Area VMScl
	75	75 73 74 65 72 20 50 72 6F 74	uster Prot
	85	6F 63 6F 6C 3A 20 4E 49 53 43	ocol: NISC
	95	41	A

F.13 Programs That Capture Retransmission Errors

You can program the HP 4972 LAN Protocol Analyzer, as shown in the following source code, to capture retransmission errors. The starter program initiates the capture across all of the LAN analyzers. Only one LAN analyzer should run a copy of the starter program. Other LAN analyzers should run either the partner program or the scribe program. The partner program is used when the initial location of the error is unknown and when all analyzers should cooperate in the detection of the error. Use the scribe program to trigger on a specific LAN segment as well as to capture data from other LAN segments.

F.13.1 Starter Program

The starter program initially sends the distributed enable signal to the other LAN analyzers. Next, this program captures all of the LAN traffic, and terminates as a result of either a retransmitted packet detected by this LAN analyzer or after receiving the distributed trigger sent from another LAN analyzer running the partner program.

The starter program shown in the following example is used to initiate data capture on multiple LAN segments using multiple LAN analyzers. The goal is to capture the data during the same time interval on all of the LAN segments so that the reason for the retransmission can be located.

Store: frames matching LAVc_all or Distrib_Enable or Distrib_Trigger ending with LAVc_TR_ReXMT or Distrib_Trigger Log file: not used Block 1: Enable_the_other_analyzers Send message Distrib_Enable and then Go to block 2 Block 2: Wait_for_the_event When frame matches LAVc_TR_ReXMT then go to block 3 Block 3: Send the distributed trigger Mark frame and then Send message Distrib_Trigger

F.13.2 Partner Program

The partner program waits for the distributed enable; then it captures all of the LAN traffic and terminates as a result of either a retransmission or the distributed trigger. Upon termination, this program transmits the distributed trigger to make sure that other LAN analyzers also capture the data at about the same time as when the retransmitted packet was detected on this segment or another segment. After the data capture completes, the data from multiple LAN segments can be reviewed to locate the initial copy of the data that was retransmitted. The partner program is shown in the following example:

Store: frames matching LAVc_all or Distrib_Enable or Distrib_Trigger ending with Distrib_Trigger Log file: not used Block 1: Wait_for_distributed_enable When frame matches Distrib_Enable then go to block 2 Block 2: Wait_for_the_event When frame matches LAVc_TR_ReXMT then go to block 3 Block 3: Send the distributed trigger Mark frame and then Send message Distrib_Trigger

F.13.3 Scribe Program

The scribe program waits for the distributed enable and then captures all of the LAN traffic and terminates as a result of the distributed trigger. The scribe program allows a network manager to capture data at about the same time as when the retransmitted packet was detected on another segment. After the data capture has completed, the data from multiple LAN segments can be reviewed to locate the initial copy of the data that was retransmitted. The scribe program is shown in the following example:

Appendix G
PEDRIVER Congestion Control and Channel Selection

Network congestion occurs as the result of complex interactions of workload distribution and network topology, including the speed and buffer capacity of individual hardware components.

G.1 PEDRIVER Congestion Control

Network congestion can have a negative impact on cluster performance in several ways:

Moderate levels of congestion can lead to increased queue lengths in network components (such as adapters and bridges) that in turn can lead to increased latency and slower response.
Higher levels of congestion can result in the discarding of packets because of queue overflow.
Packet loss can lead to packet retransmissions and, potentially, even more congestion. In extreme cases, packet loss can result in the loss of OpenVMS Cluster connections.

Thus, although a particular network component or protocol cannot guarantee the absence of congestion, PEDRIVER incorporates several improved mechanisms to mitigate the effects of congestion on OpenVMS Cluster traffic and to avoid having cluster traffic exacerbate congestion when it occurs. These mechanisms affect the retransmission of packets carrying user data and the multicast HELLO datagrams used to maintain connectivity.

G.1.1 Congestion Caused by Retransmission

Associated with each virtual circuit from a given node is a transmission window size, which indicates the number of packets that can be outstanding to the remote node (for example, the number of packets that can be sent to the node at the other end of the virtual circuit before receiving an acknowledgment [ACK]).

If the window size is 8 for a particular virtual circuit, then the sender can transmit up to 8 packets in a row but, before sending the ninth, must wait until receiving an ACK indicating that at least the first of the 8 has arrived. If an ACK is not received, a timeout occurs, and the packet is assumed lost and must be retransmitted.

For PEDRIVER running on VMS Version 5.5 or earlier:

The window size is relatively static---usually 8, 16 or 31 (for FDDI)---and the retransmission policy assumes that all outstanding packets are lost and thus retransmits them. Retransmission of an entire window of packets under congestion conditions tends to exacerbate the condition significantly.
The timeout interval for determining that a packet is lost is fixed (3 seconds). This means that the loss of a single packet can interrupt communication between cluster nodes for as long as 3 seconds.

For PEDRIVER running on OpenVMS VAX Version 6.0 or later, or OpenVMS AXP Version 1.5 or later:

The retransmission mechanism is an adaptation of algorithms originally proposed for the Internet by Van Jacobson and improves on the old mechanism by making both the window size and the retransmission timeout interval adapt to network conditions.

When a timeout occurs because of a lost packet, the window size is decreased immediately to reduce the load on the network. The window size is allowed to grow only after congestion subsides. More specifically, when a packet loss occurs, the window size is decreased to 1 and remains there, allowing the transmitter to send only one packet at a time until all the original outstanding packets have arrived.
After this occurs, the window is allowed to grow quickly until it reaches half its previous size. Once reaching the halfway point, the window size is allowed to increase relatively slowly to take advantage of available network capacity until it reaches a maximum value determined by the configuration variables (for example, number of adapter buffers).
The retransmission timeout interval is set based on measurements of actual round-trip times for packets that are transmitted over the virtual circuit. This allows PEDRIVER to be more responsive to packet loss in most networks but avoids premature timeouts for networks in which the actual round-trip delay approaches several seconds.

G.1.2 HELLO Multicast Datagrams

PEDRIVER periodically multicasts a HELLO datagram over each network adapter attached to the node. The HELLO datagram serves two purposes:

It informs other nodes of the existence of the sender so that they can form channels and virtual circuits.
It helps to keep communications open once they are established.

HELLO datagram congestion and loss of HELLO datagrams can prevent connections from forming or cause connections to be lost. Table G-1 describes conditions causing HELLO datagram congestion and how PEDRIVER helps avoid the problems. The result is a substantial decrease in the probability of HELLO datagram synchronization and thus a decrease in HELLO datagram congestion.

Table G-1 Conditions that Create HELLO Datagram Congestion
Conditions that cause congestion How PEDRIVER avoids congestion

If all nodes receiving a HELLO datagram from a new node responded immediately, the receiving network adapter on the new node could be overrun with HELLO datagrams and be forced to drop some, resulting in connections not being formed. This is especially likely in large clusters. To avoid this problem on nodes running:

On VMS Version 5.5--2 or earlier, nodes that receive HELLO datagrams delay for a random time interval of up to 1 second before responding.
On OpenVMS VAX Version 6.0 or later, or OpenVMS AXP Version 1.5 or later, this random delay is a maximum of 2 seconds to support large OpenVMS Cluster systems.

If a large number of nodes in a network became synchronized and transmitted their HELLO datagrams at or near the same time, receiving nodes could drop some datagrams and time out channels. On nodes running VMS Version 5.5--2 or earlier, PEDRIVER multicasts HELLO datagrams over each adapter every 3 seconds, making HELLO datagram congestion more likely.
On nodes running OpenVMS VAX Version 6.0 or later, or OpenVMS AXP Version 1.5 or later, PEDRIVER prevents this form of HELLO datagram congestion by distributing its HELLO datagram multicasts randomly over time. A HELLO datagram is still multicast over each adapter approximately every 3 seconds but not over all adapters at once. Instead, if a node has multiple network adapters, PEDRIVER attempts to distribute its HELLO datagram multicasts so that it sends a HELLO datagram over some of its adapters during each second of the 3-second interval.
In addition, rather than multicasting precisely every 3 seconds, PEDRIVER varies the time between HELLO datagram multicasts between approximately 1.6 to 3 seconds, changing the average from 3 seconds to approximately 2.3 seconds.

**Table G-1 Conditions that Create HELLO Datagram Congestion**
Conditions that cause congestion	How PEDRIVER avoids congestion
If all nodes receiving a HELLO datagram from a new node responded immediately, the receiving network adapter on the new node could be overrun with HELLO datagrams and be forced to drop some, resulting in connections not being formed. This is especially likely in large clusters.	To avoid this problem on nodes running: On VMS Version 5.5--2 or earlier, nodes that receive HELLO datagrams delay for a random time interval of up to 1 second before responding. On OpenVMS VAX Version 6.0 or later, or OpenVMS AXP Version 1.5 or later, this random delay is a maximum of 2 seconds to support large OpenVMS Cluster systems.
If a large number of nodes in a network became synchronized and transmitted their HELLO datagrams at or near the same time, receiving nodes could drop some datagrams and time out channels.	On nodes running VMS Version 5.5--2 or earlier, PEDRIVER multicasts HELLO datagrams over each adapter every 3 seconds, making HELLO datagram congestion more likely. On nodes running OpenVMS VAX Version 6.0 or later, or OpenVMS AXP Version 1.5 or later, PEDRIVER prevents this form of HELLO datagram congestion by distributing its HELLO datagram multicasts randomly over time. A HELLO datagram is still multicast over each adapter approximately every 3 seconds but not over all adapters at once. Instead, if a node has multiple network adapters, PEDRIVER attempts to distribute its HELLO datagram multicasts so that it sends a HELLO datagram over some of its adapters during each second of the 3-second interval. In addition, rather than multicasting precisely every 3 seconds, PEDRIVER varies the time between HELLO datagram multicasts between approximately 1.6 to 3 seconds, changing the average from 3 seconds to approximately 2.3 seconds.

G.2 Transmit Channel Selection

Of the channels available to a given remote node, PEDRIVER uses a single channel to transmit datagrams and all channels to receive datagrams. The driver software chooses a transmission channel for each remote node. A selection algorithm for the transmission channel ensures that messages are sent in the order they are expected to be received. Sending the messages in this way also maintains compatibility with previous versions of the operating system. The selected transmission channel is called the preferred channel.

G.2.1 Preferred Channel

At any point in time, the TR level of the NISCA protocol can modify its choice of a preferred channel based on the following:

Minimum measured delay
The NISCA protocol routinely measures HELLO message delays and uses these measurements to pick the most lightly loaded channel on which to send messages.
Maximum packet size
PEDRIVER favors channels with large packet sizes. For example, an FDDI-to-FDDI channel is favored over an FDDI-to-Ethernet channel or an Ethernet-to-Ethernet channel. If your configuration uses FDDI to Ethernet bridges, the PPC level of the NISCA protocol segments messages into the smaller packet sizes allowed by Ethernet before transmitting them.

PEDRIVER continually computes a network delay value for each channel. PEDRIVER switches the preferred channel based on observed network delays or network component failures. Switching to a new transmission channel sometimes causes messages to be received out of the desired order. PEDRIVER uses a receive cache to reorder these messages instead of discarding them, which eliminates unnecessary retransmissions.

With these algorithms, PEDRIVER has a greater chance of utilizing multiple adapters on a server node that communicates continuously with a number of clients. In a two-node cluster, PEDRIVER will actively use at most two LAN adapters: one to transmit and one to receive. Additional adapters provide both higher availability and alternative paths that can be used to avoid network congestion. As more nodes are added to the cluster, PEDRIVER is more likely to use the additional adapters.

G.2.2 Restrictions

Some restrictions apply to remote nodes running a version of the operating system prior to VMS Version 5.4--3. Messages received out of order on such remote nodes are discarded because they lack the receive cache. For these remote nodes, PEDRIVER cannot switch channels based on the observed network delays. In this case, PEDRIVER chooses a single transmission channel and uses it until the channel fails. Only at that time will PEDRIVER switch to another channel.

Index

Contents

privacy and legal statement

4477PRO_034.HTML

OpenVMS Cluster Systems

F.11.3 Setting Up the Distributed Enable Filter

Appendix GPEDRIVER Congestion Control and Channel Selection

G.1.2 HELLO Multicast Datagrams

Appendix G
PEDRIVER Congestion Control and Channel Selection