Document revision date: 19 July 1999 | |
Previous | Contents | Index |
Consider the following guidelines when using MONITOR:
See the OpenVMS System Manager's Manual: Tuning, Monitoring, and Complex Systems and the OpenVMS System Management Utilities Reference Manual: M--Z for information about using
MONITOR.
4.4.1 Types of Output
MONITOR generates the following types of output:
MONITOR provides two input modes of operation for collecting data---live and playback.
Use live mode to collect data on a running system and to generate one or more of the following types of MONITOR output---ASCII screen images, binary recording files, or formatted ASCII summary files.
Use live mode to display data about a remote system connected to your system with DECnet for OpenVMS.
Use playback mode to read a binary recording file and produce one or
more of the following types of MONITOR output---ASCII screen images,
binary recording files, or formatted ASCII summary files.
4.4.3 Creating a Performance Information Database
As a foundation for the strategy discussed in this chapter, you must develop a database of performance information for your system by running MONITOR continuously as a background process.
The SYS$EXAMPLES directory provides three command procedures you can use to establish the database. The following table describes the procedures:
Procedure | Description |
---|---|
SUBMON.COM | Starts MONITOR.COM as a detached process. |
MONITOR.COM | Creates a summary file from the binary recording file of the previous boot, then begins recording for this boot. The recording interval is 10 minutes. |
MONSUM.COM (VAX only) | Generates two OpenVMS Cluster multifile summary reports: one for the previous 24 hours and one for the previous day's prime-time period (9 a.m. to 6 p.m.). These reports are mailed to the system manager, and then the procedure resubmits itself to run each day at midnight. |
When MONITOR data is recorded continuously, a summary report can cover
any contiguous time segment.
4.4.4 Saving Your Summary Reports
The two multifile summary reports reports are not saved as files. To keep them, you must do either of the following:
The report you require for the evaluation procedure is one that covers a period that best represents the typical operation of your system. You might want, for example, to evaluate your system only during hours of peak acitvity.
To generate a summary of the appropriate time segment, edit the
MONSUM.COM command procedure and change the beginning and ending times
on one of the two MONITOR commands that produce the summary reports.
4.4.6 Report Formats
The summary reports produced by MONSUM.COM are in the multifile summary format---there is one column of averages for each node in a VMScluster, as well as some overall row statistics. For noncluster systems, the row statistics can be ignored.
If you prefer to use a report in the standard summary format (which
includes current, minimum, and maximum statistics), execute a MONITOR
playback summary command referencing the input data file of interest as
the only file in the /INPUT list. Note that a new data file is created
for each system whenever it reboots. Remember to use the /BEGINNING and
/ENDING qualifiers to select the desired time period.
4.4.7 Using MONITOR in Live Mode
You are encouraged to observe current system activity regularly by running MONITOR in live mode. In live mode, always begin an analysis with the MONITOR CLUSTER and MONITOR SYSTEM classes to obtain an overview of system performance.
Then, monitor other classes to examine components of particular interest.
All references to MONITOR items in this chapter are assumed to be for the average statistic, unless otherwise noted. |
In multifile reports, a page or more is devoted to each MONITOR class. Each column represents one node, and is headed by the node name and beginning and ending times of the segment requested. In most cases, time segments for all nodes will be roughly the same. Differences of a few minutes are typical, because data collection on the various nodes is not synchronized.
In some cases, one or more time segments will be shorter than others; in these cases, some of the requested data was not recorded (probably because the nodes were unavailable). Note that if data is unavailable for some period within the bounds of a request, that fact is not explicitly specified.
However, such a gap can occur only when the column of data uses more than one input file; and if multiple files contributed to the column, the number is shown in parentheses to the right of the node name. In cases where a time segment is missing, this number must be greater than 1. If no number appears, there is only one input data file for that column, and the column includes no missing time segments.
To summarize: if all beginning and ending times are not roughly the
same or if a parenthesized number appears, some data may be
unavailable, and you may want to base your evaluation on a different
time segment that includes more complete data. Whenever the multifile
report is based on incomplete data, the Row Average statistic can be
weighted unfairly in favor of one or more nodes.
4.4.9 Interpreting MONITOR Statistics
While interpreting MONITOR statistics, keep in mind that the collection interval has no effect on the accuracy of MONITOR rates. It does, however, affect levels, because they represent sampled data. In other words, the smaller the collection interval, the more accurate MONITOR level statistics will be. (For more information on MONITOR rates and levels, refer to the OpenVMS System Manager's Manual: Tuning, Monitoring, and Complex Systems.)
Although the interval value supplied with MONITOR.COM is adequate for most purposes, it does represent a trade-off between statistical accuracy and the consumption of disk space. Thus, before you base major decisions on MONITOR level statistics, be sure to verify them by running MONITOR for a time with a much smaller collection interval while carefully observing disk space usage.
This chapter describes how to track down system resources that can
limit performance. When you suspect that your system performance is
suffering from a limited resource, you can begin to investigate which
resource is most likely responsible. In a correctly behaving system
that becomes fully loaded, one of the three resources---memory, I/O, or
CPU---becomes the limiting resource. Which resource assumes that role
depends on the kind of load your system is supporting.
5.1 Diagnostic Strategy
Appendix A contains a number of decision trees for diagnosing limiting resources. Note that the diagrams include command recommendations to help you obtain required information. The recommended commands appear in parentheses below the description of the information required.
The procedures use the process of elimination to determine the source of performance problems. There are fairly simple tests you can use to rule out certain classes of problems.
Use the following guidelines when conducting your preliminary investigation:
Your preliminary investigation can proceed by checking for the
possibility of memory limitations, then I/O limitations, and finally a
CPU limitation.
5.2.1 Memory Limitations
Memory limitations are manifestations of such diverse problems as too little physical memory for the work attempted, inappropriate use of the memory management features, improper assignments of memory resources to users, and so forth.
To determine if you may have memory limitations, use the DCL commands MONITOR IO or MONITOR PAGE as shown in the following table:
If you... | Then you... |
---|---|
observe
|
can rule out memory limitations. |
observe significant inswapping, little free memory, or significant paging | should investigate memory limitations further. (See Chapter 7.) |
You can also determine memory limitations by using SHOW SYSTEM to
review the RW_FPG and RW_MPG parameters. If either parameter is
displayed consistently, there is a serious shortage of memory. Very
little improvement can be made by tuning the system. Compaq recommends
buying more memory.
5.2.2 I/O Limitations
I/O limitations occur when the number or speed of devices is insufficient. You will also find an I/O limitation when application design errors either place inappropriate demand on particular devices or do not employ sufficiently large blocking factors or numbers of buffers.
To determine if you may have an I/O limitation, enter the DCL command MONITOR IO or MONITOR SYSTEM and observe the rates for direct I/O and buffered I/O.
If... | Then you... |
---|---|
your system is not performing any direct I/O | do not have a disk I/O limitation. |
you observe that there is no buffered I/O | do not have a terminal I/O limitation. |
either or both operations are occurring | cannot rule out the possibility of an I/O limitation. (See Chapter 8.) |
The CPU can become the binding resource when the work load places extensive demand on it. Perhaps all the work becomes heavily computational, or there is some condition that gives unfair advantages to certain users.
To determine if there is a CPU limitation, use the DCL command MONITOR STATES.
You might also use the DCL command MONITOR MODES to observe the amount of user mode time. The MONITOR MODES display also reveals the amount of idle time, which is sometimes called the null time.
If... | Then... |
---|---|
many of your processes are in the computable state | you can conclude there is a CPU limitation. |
many of your processes are in the computable outswapped state | be sure to address the issue of a memory limitation first. (See Section 9.2.4.) |
the user mode time is high | it is likely there is a limitation occurring around the CPU utilization. |
there is almost no idle time | it is fair to conclude that the CPU is being heavily used. |
A final indicator of a CPU limitation that the MONITOR MODES display
provides is the amount of kernel mode time. A high percentage of time
in kernel mode can indicate excessive consumption of the CPU resource
by the operating system. This problem is more likely the result of a
memory limitation but could indicate a CPU limitation as well. If you
decide to investigate the CPU limitation further, proceed through the
steps in Chapter 9.
5.3 After the Preliminary Investigation
When you have completed your preliminary investigation, you are ready to:
Once you take the appropriate remedial action, monitor the effectiveness of the changes and, if you do not obtain sufficient improvement, try again. In some cases, you will need to repeat the same steps, but either increase or decrease the magnitude of the changes you made. In other cases, you will proceed further in the investigation and uncover some other underlying cause of the problem and take corrective steps.
The diagrams and text do not attempt to depict this looping. Rather, repetition is always implied, pending the outcome of the changes. Therefore, tuning is frequently an iterative process. The approach to tuning presented by this chapter and Chapter 10 assumes that multiple causes of performance problems are uncovered by repeating the steps shown until you achieve satisfactory performance.
Effective tuning requires that you can observe the undesirable performance behavior while you test. |
You will find it especially helpful to keep a listing of the current values of all your system parameters nearby as you conduct the following investigations. Running SYSGEN and specifying a file name is one method for obtaining this listing. (See the OpenVMS System Manager's Manual: Tuning, Monitoring, and Complex Systems.)
$ RUN SYS$SYSTEM:SYSGEN SYSGEN> SET/OUTPUT=filename SYSGEN> SHOW/ALL SYSGEN> SHOW/SPECIAL SYSGEN> EXIT $ PRINT/DELETE filename |
Overall responsiveness of a system depends largely on the
responsiveness of its CPU, memory, and disk I/O resources. If each
resource responds satisfactorily, then so will the entire system.
6.1 Understanding System Responsiveness
Each resource must operate efficiently by itself and it must also interact with other resources.
An important aspect of your evaluation is to distinguish between resources that might be performing poorly because they are overcommitted and those that might doing so because one or both of the following conditions has occurred:
A binding resource or bottleneck is an overcommitted resource that causes the others to be blocked or burdened with overhead operations. Proper identification of such a resource is critical to correction of a performance problem. Upgrading a nonbinding resource will do nothing to improve a bottlenecked system.
Detecting bottlenecks is particularly important for analyzing interactions of the CPU with each of the other resources.
For example, CPU blockage occurs when CPU capacity, though it appears
sufficient to meet demand, cannot be used because the CPU must wait for
disk I/O to complete or memory to be allocated.
6.1.2 Balancing Resource Capacities
Because of the potential for bottlenecks, it is especially important to maintain balance among the capacities of your system's resources.
For example, when upgrading to a faster CPU, consider the effect the
additional CPU power will have on the other primary resources. Because
the faster CPU can initiate more I/O requests per unit of time, you
must ensure that the disk I/O subsystem has sufficient capacity to
handle the increased traffic.
6.2 Evaluating Responsiveness of System Resources
For each resource, key MONITOR statistics help you answer such questions as:
Two prime measures of resource responsiveness include:
For each resource, you can use MONITOR summaries to examine or estimate
one or both of these quantities.
6.3 Improving Responsiveness of System Resources
You can investigate four main ways to improve responsiveness:
Example
Excess memory capacity is often used to reduce the demand on an
overworked disk I/O subsystem by increasing the size of each I/O
transfer, thereby reducing the total number of I/O operations.
The
CPU benefits as well, because it needs to do less work executing system
services and device driver software.
The primary means of
offloading I/O to memory is the extensive use of caches (page caches,
XQP caches, virtual I/O caching, RMS blocking) to reduce the number of
I/O operations.
If the responsiveness of a poorly performing resource cannot be improved by these methods, you should consider augmenting its capacity with additional or upgraded hardware.
Previous | Next | Contents | Index |
privacy and legal statement | ||
6491PRO_005.HTML |