|
|
Updated:
11 December 1998
|
OpenVMS User's Manual
11.8 Optimizing a Sort or Merge Operation
There are several ways in which you can improve the efficiency of a
Sort or Merge operation, based on your sorting environment. Use the
/STATISTICS qualifier with the SORT or MERGE command to get information
about the variables in your sorting environment.
After you examine the statistics display, consider any of the
optimization options presented in the following sections.
When you enter the SORT or MERGE command with the /STATISTICS
qualifier, you see output similar to the following:
$ SORT/STATISTICS PAGEANT.LIS DOCUMENT.LIS
OpenVMS Sort/Merge Statistics
Records read: 3 (1) Input record length: 26
Records sorted: 3 Internal length: 28
Records output: 3 Output record length: 26
Working set extent: 16384 (2) Sort tree size: 42
Virtual memory: 392 Number of initial runs: 0
Direct I/O: 10 Maximum merge order: 0
Buffered I/O: 11 Number of merge passes: 0
Page faults: 158 (3) Work file allocation: 0 (4)
Elapsed time: 00:00:00.54 Elapsed CPU: 00:00:00.03 (5)
|
As you examine the fields, note the following:
- Records read
Lists the number of records
that were read during a Sort operation. See Section 11.8.2 for
information on selectively omitting records from a Sort operation.
- Working set extent
Shows how many blocks
are reserved to perform the sort operation. See Section 11.8.4 for
information on making your working set larger.
- Page faults
Shows how many times the
operating system has transferred parts of your process from physical
memory to your paging device. See Section 11.8.4 for more information on
preventing paging.
- Work file allocation
Shows how much disk
space is reserved for your work file. See Section 11.8.3 for more
information on work files.
- Elapsed CPU
Shows how much CPU time the
operating system took to process the sort operation. See Section 11.8.1
for information on saving time by choosing different methods of sorting.
11.8.1 Sorting Process
Sort defines four processes for sorting data internally: record, tag,
address and indexed. (The high-performance Sort/Merge utility supports
only the record process. Implementation of tag, address, and index
processes is deferred to a future OpenVMS Alpha release.) RECORD is the
default process. The type of process you choose affects the performance
of the Sort operation as well as storage requirements. See the
Section 11.2.6 for information about the different sort processes.
Before you select a sorting process, consider the following:
- How you will use the output file
- Because record and tag sorting generate files that contain entire
sorted records, these reordered files are ready to be used.
- Both address- and index-sorted output files can be processed by a
program written in a programming language such as Pascal, Fortran,
MACRO, or C.
- Address sorting creates an output file of pointers to the records
in the input file. This list consists of binary RFAs plus a file number
when sorting multiple input files. A program accesses the records by
using the pointers.
- Index sorting creates an output file containing both RFAs and key
fields plus a file number when sorting multiple files. The format of
these key fields is the same as in the input files. If the program
needs the key field contents for a decision during future processing,
select index sorting rather than address sorting.
If you need to reorder records from one file in several ways for
different purposes, store several output files from address or index
sorting. Use the output files to access the records in the main file in
the sorted order that you want.
- The temporary storage space available for sorting
Tag sorting
uses less temporary storage space than record sorting. Because record
sorting keeps the record intact during the sort, it uses much more work
space when the files are large. Address and index sorting use little
temporary storage space.
- The type of input and output device used
Record sorting is the
only process that can accept input from cards, magnetic tape, and
disks. Output from tag and record sorting can go to any output device.
Output from address and index sorting must go to a device that accepts
binary data.
- The differences in speed
If you plan to retrieve the sorted
records at some point in the operation, record sorting is usually the
fastest process. Otherwise, address and index sorting are the fastest
processes.
11.8.2 Omitting Records and Fields
From a specification file, you can improve Sort efficiency by using the
/CONDITION, /INCLUDE, and /OMIT qualifiers to process only those
records needed in the output file. (The high-performance Sort/Merge
utility does not support specification files. Implementation of this
feature is deferred to a future OpenVMS Alpha release.) You can also
use specification file qualifiers to reformat records, omitting
unnecessary fields from the output file. These qualifiers are not
available as command line qualifiers.
11.8.3 Assigning Work Files
During a Sort operation, records from the input file are read into
memory. If the allocated memory cannot hold all the records, Sort
transfers the sorted data to one or more temporary work files. Merge
does not use work files.
You can increase sort efficiency by changing the number of work files
and by assigning them to specific devices:
- The Sort command line qualifier /WORK_FILES=n overrides
the number of work files allocated.
- Normally, Sort places work files on the device SYS$SCRATCH and
accesses them in an arbitrary order. You can assign work files to
specific devices in two ways:
- In a specification file, the /WORK_FILES=(device,...)
qualifier places the work files on the specified devices. See
Section 11.9.3 for more information about using the /WORK_FILES
qualifier in a specification file.
- If you are not using a specification file, you can use the DCL
command ASSIGN to assign the work files to specific devices.
Sort
uses the SORTWORKn logical names to identify user-specified
device names for the workfiles, where n is a value from 0
through 9. (For the high-performance Sort/Merge utility, n is
a value from 0 to 254.) Define a SORTWORKn logical as follows:
For example,
$ ASSIGN WORK$2: SORTWORK1
$ ASSIGN WORK$3: SORTWORK2
|
This example defines SORTWORK1 as the device WORK$2: and SORTWORK2
as the device WORK$3:. For more information on logical names, see
Chapter 13.)
Consider the following when you assign work files to devices:
- Assign work files to the fastest devices available. For example,
random-access, mass storage devices such as disks.
- Choose devices with the least activity and the most space available.
- Assign each work file to a different physical device to maximize
overlapping input and output.
11.8.4 Modifying the Working Set Extent
If Sort requires work files (for example, if you are sorting a large
file), a larger working set can increase sort efficiency. However, if
your system is used heavily, it might be unable to allocate all the
pages in the working set extent to your process. This can result in
paging, which occurs when the operating system transfers parts of a
process between physical memory and memory on a paging device; only the
active part of the process remains in the physical memory. To avoid
excessive paging, you can decrease the working set extent for your
process. (Use the SET WORKING_SET command to decrease the working set
extent.)
11.9 Summary of Sort/Merge Qualifiers
The following list describes command qualifiers used with the SORT and
MERGE commands. To use a command qualifier, include the qualifier
immediately after the SORT or MERGE command.
/[NO]CHECK_SEQUENCE
- This qualifier applies to the MERGE command only. It verifies the
sequence of the records in MERGE input files. Merge checks the sequence
of records by default.
The /CHECK_SEQUENCE qualifier checks whether
the records of one or more files (up to 10; the high-performance
Sort/Merge utility supports up to 12) have been sorted. (The records
will still be directed to an output file, which you must specify.) If
you are checking whether records are sorted on a key field other than
the entire record, you must specify key information, along with the
requesting sequence.
Use the /NOCHECK_SEQUENCE qualifier to prevent
Merge from checking the sequence of records.
Example
$ MERGE/KEY=(SIZE:4,POSITION:3)/NOCHECK_SEQUENCE -
_$ PRICE1.DAT,PRICE2.DAT PRICE.LIS
|
In this example, the /NOCHECK_SEQUENCE qualifier specifies that the
sequence of the input files, PRICE1.DAT and PRICE2.DAT, is not to be
checked.
/COLLATING_SEQUENCE=sequence
- Selects one of three predefined collating orders for character key
fields, or specifies the name of a National Character Set (NCS)
collating sequence to be used in comparing character keys. (The
high-performance Sort/Merge utility does not support the NCS collating
sequences. Support for NCS collating sequences is deferred to a future
OpenVMS Alpha release.) Sort can arrange characters in ASCII (default),
EBCDIC, or Multinational sequences.
Example
$ SORT/COLLATING_SEQUENCE=MULTINATIONAL -
_$ NAMES.DAT,NOM.DAT LIST.LIS
|
This SORT command arranges the input files NAMES.DAT and NOM.DAT
according to the Multinational collating sequence to create the output
file LIST.LIS.
/[NO]DUPLICATES
- By default, Sort retains all multiple records with duplicate keys.
The /NODUPLICATES qualifier eliminates all but one of multiple records
with duplicate keys. The retained records may not appear in the same
order as they appeared in the input file. If you want to specify which
duplicate record to keep, invoke Sort at the program level and specify
an equal-key routine.
The /STABLE and the /NODUPLICATES qualifiers
are mutually exclusive.
Example
$ SORT/KEY=(POSITION:3,SIZE:5,DECIMAL)/NODUPLICATES -
_$ ACCT1,ACCT2 ACCT.LIS
|
This SORT command arranges the two input files according to the key
supplied and eliminates all but one of multiple records with equal keys.
/KEY=(POSITION:n,SIZE:n[,field,...])
- Describes key fields, including the position, size, sorting order
(ASCENDING or DESCENDING), priority (NUMBER:n), and data type (such as
character, binary, h_floating). By default, Sort reorders a file by
sorting entire records with character data in ascending order.
See
Section 11.2.1 for detailed information about the /KEY qualifier.
/PROCESS=type
- (Applies to the SORT command only.) Defines the internal sorting
process. The /PROCESS qualifier allows you to choose one of four
processes: record, tag, address, or index. (The high-performance
Sort/Merge utility supports only the record process. Implementation of
tag, address, and index processes is deferred to a future OpenVMS Alpha
release.)
See Section 11.2.6 for detailed information about the
/PROCESS qualifier.
Example
$ SORT/KEY=(POS:40,SIZ:2,DESC)/PROCESS=TAG YRENDAVG.DAT -
_$ DESCYRAVG.LIS
|
This Sort operation uses a tag sorting process to create the output
file DESCYRAVG.LIS.
/SPECIFICATION=filespec
(The high-performance Sort/Merge utility does not support this
qualifier. Implementation of this feature is deferred to a future
OpenVMS Alpha release.)
- Identifies a Sort or Merge specification file to be used in a Sort
or Merge operation. The default specification file type is .SRT.
See Section 11.7 and Section 11.9.3 for information about using
specification files.
/[NO]STABLE
- By default, records with equal keys are not guaranteed to be placed
in the output file in the order they appear in the input file. The
/STABLE qualifier maintains the records in that order.
The /STABLE
and /NODUPLICATES qualifiers are mutually exclusive.
Example
$ SORT/KEY=(POS:1,SIZ:5,DECIMAL)/STABLE PRICESA.DAT, -
_$ PRICESB.DAT,PRICESC.DAT SUMMARY.LIS
|
In this Sort operation, records with equal keys from PRICESA.DAT
will be listed first, followed by those from PRICESB.DAT, followed by
those from PRICESC.DAT.
/[NO]STATISTICS
- Displays a statistical summary to SYS$OUTPUT that can be used for
optimization. To save these statistics in a file, use the following
command:
$ DEFINE/USER SYS$ERROR output-file
|
The statistical summary contains the following information:
Statistic |
Description |
Records read
|
The number of records read by Sort or Merge.
|
Records sorted
|
The number of records that have been processed using Sort. This number
could be less than the number of records read if a specification file
is used to select only certain records for the Sort or Merge operation.
|
Records output
|
The number of records written to the output file. This number could be
less than the number of records sorted if /NODUPLICATES was selected or
if I/O errors occurred when the output records were being written.
|
Working set extent
|
The number of pages in the process working set extent. This value is
used as an upper limit on the size of the sort data structure.
Adjusting this value is one way to improve the efficiency of a Sort
operation.
|
Virtual memory
|
The number of pages of virtual memory added to the Sort image to hold
the data.
|
Direct I/O + buffered I/O
|
This total is the number of I/O movements needed to read and write
data. The lower this total value is, the more efficient the ordering
operation.
|
Page faults
|
Indicates how well the data fits into memory: the higher the number of
page faults, the less efficient the ordering operation.
|
Elapsed time
|
The total wall clock time used by the Sort or Merge operation in hours,
minutes, seconds, and hundredths of seconds.
|
Input record length
|
This value is obtained from the Record Management Services (OpenVMS
RMS) unless the user supplies it.
|
Internal length
|
The size in bytes of an internal format node. This includes any keys,
data, a word to store the length, record file addresses (RFAs), and
converted keys.
|
Output record length
|
The length of the output record. The length is computed from the input
record length, the sort process, and the record reformatting requested.
|
Sort tree size
|
The number of records that fit in the Sort internal data structure.
|
Number of initial runs
|
One indication of how well the data fits into memory.
|
Maximum merge order
|
The maximum number of sorted strings that are merged at one time.
|
Number of merge passes
|
The number of times the Sort utility merges strings until one sorted
output string is produced. The number of initial runs and the number of
merge passes indicate how well the data fits into memory. The higher
these numbers, the further the working set size is from containing the
data and the longer the sorting takes.
|
Work file allocation
|
The number of blocks used for the work files. When more than one merge
pass is needed, this size is approximately twice the size of the input
file allocation.
|
Elapsed CPU
|
The CPU time used by the ordering operation; it does not include time
spent waiting for I/O operations to complete or time spent waiting
while another process executes.
|
Example
$ SORT/STATISTICS PRICE1.DAT,PRICE2.DAT PRICE.LIS
|
This SORT /STATISTICS command results in the following statistical
display:
OpenVMS Sort/Merge Statistics
Records read: 793 Input record length: 80
Records sorted: 793 Internal length: 80
Records output: 793 Output record length: 80
Working set extent: 100 Sort tree size: 412
Virtual memory: 433 Number of initial runs: 2
Direct I/O: 22 Maximum merge order: 2
Buffered I/O: 9 Number of merge passes: 1
Page faults: 3418 Work file allocation: 114
Elapsed time: 00:00:05.98 Elapsed CPU: 00:00:03.63
|
/WORK_FILES[=n]
- (Applies to the SORT command only.) Increases the number of Sort
work files by any number, from 1 to 10 (the high-performance Sort/Merge
utility supports up to 255) inclusively, to make each work file
smaller. If the available disks are too small or too full for work
files, increasing the number of files can improve the efficiency of the
Sort operation.
Sort does not create work files until it needs
them. If Sort needs work files, it creates two by default (SORTWORK0,
SORTWORK1), which are placed in the SYS$SCRATCH directory.
Example
$ ASSIGN DRA5: SORTWORK0
$ ASSIGN DB0: SORTWORK1
$ ASSIGN DB1: SORTWORK2
$ SORT/KEY=(POS:1,SIZ:80)/WORK_FILES=3 -
_$ STATS1,STATS2,STATS3,STATS4 SUMMARY.LIS
|
Because the input files in this Sort operation are large files,
specifying three work files improves the efficiency of the sort
operation.
Note that you can also assign the work files to a
specific directory on a device by including the directory name. For
example, to assign SORTWORK0 to the [WORKSPACE] directory on DRA5,
enter the following command:
$ ASSIGN DRA5:[WORKSPACE] SORTWORK0
|
11.9.1 Input File Qualifier
The following input qualifier should be included immediately after the
input file specification in the SORT or MERGE command line:
/FORMAT=(RECORD_SIZE:n,FILE_SIZE:n)
- Defines input file characteristics; allows you to specify or
override record or file size. It must be specified immediately after
the input file specification in the Sort or Merge command line.
Sort uses input file size information to determine the amount of
memory needed, as well as the size of the work files for the Sort
operation. If the file size is unknown (for example, you are sorting
files that do not reside on disk or standard ANSI magnetic tape), Sort
assumes a fairly large file size.
Specify the following qualifier
values:
RECORD_SIZE:
n
|
Specifies the input file's longest record length (LRL) in bytes. The
maximum longest record length that can be specified depends on the file
organization:
Sequential
|
32,767
|
Relative
|
16,383
|
Indexed-sequential
|
16,362
|
|
|
These values include control bytes for variable records with
fixed-length control (VFC) format.
|
FILE_SIZE:
n
|
Specifies input file size in blocks. The maximum file size accepted is
4,294,967,295 blocks.
|
You can also use /FORMAT as an output file qualifier. See
Section 11.9.2 for more information.
Example
$ SORT/KEY=(POS:40,SIZ:2,DESC) -
_$CRA0:YRENDAVG.DAT/FORMAT=(RECORD_SIZE:41,FILE_SIZE:3) -
_$DESCYRAVG.LIS
|
Because the input file YRENDAVG.DAT does not reside on a disk
device or ANSI magnetic tape, file organization must be described by
the /FORMAT qualifier.
11.9.2 Output File Qualifiers
The following output qualifiers can be used with the SORT and MERGE
commands. To use an output file qualifier, include the qualifier
immediately after the output file specification in the SORT or MERGE
command line.
/ALLOCATION=n
- Specifies the number of blocks, from 1 through 4,294,967,295, to be
preallocated to the output file for optimization. Use this qualifier
when you know that the output file allocation will differ substantially
from the total input file allocation (for example, when reformatting
data or omitting records).
The /ALLOCATION qualifier is required if
the /CONTIGUOUS qualifier is used.
Example
$ SORT/KEY=(POS:1,SIZ:80) STATS.DAT -
_$ SUMMARY.LIS/ALLOCATION=1000/CONTIGUOUS
|
This SORT command allocates 1000 contiguous blocks for the output
file SUMMARY.LIS.
/BUCKET_SIZE=n
- Specifies OpenVMS RMS bucket size (the number of 512-byte blocks
per bucket) to be used by relative and indexed sequential output disk
files for optimization. A value of 1 through 32 is allowed.
If the
output file organization is the same as for the input files, the
default value is the same as the bucket size of the first input file.
If output file organization is different, the default value is 1.
Example
$ SORT/KEY=(POS:1,SIZ:80) STATS1.DAT,STATS2.DAT -
_$ SUMMARY.LIS/BUCKET_SIZE=16/RELATIVE
|
This SORT command results in the output file SUMMARY.LIS that has a
bucket size of 16 with relative organization.
/CONTIGUOUS
- Requests that the output file be stored in contiguous disk blocks
to decrease access time. Must be used with the /ALLOCATION qualifier.
By default, Sort/Merge does not allocate contiguous disk blocks for the
output file.
Example
$ SORT/KEY=(POS:1,SIZ:80) STATS.DAT -
_$ SUMMARY.LIS/ALLOCATION=1000/CONTIGUOUS
|
This SORT command allocates 1,000 contiguous blocks for the output
file SUMMARY.LIS.
/FORMAT=(type:n[,...])
- Specifies the output file record format (FIXED:n, VARIABLE:n, or
CONTROLLED:n) if it differs from the input file format. You can also
specify the size (SIZE:n) or the block size (BLOCK_SIZE:n) of the file
records.
If the Sort operation is a record or tag sort, the default
output record format is the same as the first input file record format.
If the Sort operation is an address or index sort, the default output
record format is fixed record format. If the input files have different
record formats, Sort provides an output record size that is large
enough to contain the largest record in the input files.
You can
specify the following qualifier values.
BLOCK_SIZE:
n
|
Specifies the output file's block size, in bytes, if you have directed
the file to magnetic tape. If the input file is a tape file, the block
size of the output file defaults to that of the input file. Otherwise,
the output file block size defaults to the size used when the tape was
mounted.
|
|
Acceptable values for
n range from 20 to 65,532. To ensure correct data interchange
with other Digital systems, however, specify a block size of not more
than 512 bytes. For compatibility with systems that are not made by
Digital, the block size should not exceed 2,048 bytes.
|
CONTROLLED:
n
|
Specifies variable with fixed-length control (VFC) records in the
output file.
|
FIXED:
n
|
Specifies fixed-length records in the output file.
|
SIZE:
n
|
Specifies the size, in bytes, of the fixed portion of VFC (CONTROLLED)
records, up to a maximum of 255 bytes. If you do not specify SIZE, the
default is the size of the fixed portion of the first input file. If
you specify this size as 0, OpenVMS RMS defaults the value to 2 bytes.
|
VARIABLE:
n
|
Specifies variable-length records in the output file.
|
For any qualifier value, you can optionally specify n as
the maximum record size (in bytes) of the output records. The maximum
record size allowed depends on the file organization:
Sequential files
|
32,767
|
Relative files
|
16,383
|
Indexed-sequential files
|
16,362
|
These maximum record size values include control bytes for variable
records with fixed-length control (VFC) format.
Example