Compaq Fortran
User Manual for
OpenVMS Alpha Systems

5.7.2.2 Integer Multiplication and Division Expansion

Expansion of multiplication and division refers to bit shifts that allow faster multiplication and division while producing the same result. For example, the integer expression (I*17) can be calculated as I with a 4-bit shift plus the original value of I. This can be expressed using the Compaq Fortran ISHFT intrinsic function:

J1 = I*17 J2 = ISHFT(I,4) + I ! equivalent expression for I*17

The optimizer uses machine code that, like the ISHFT intrinsic function, shifts bits to expand multiplication and division by literals.

5.7.2.3 Compile-Time Operations

Compaq Fortran does as many operations as possible at compile time rather than having them done at run time.

Constant Operations

Compaq Fortran can perform many operations on constants (including PARAMETER constants):

Constants preceded by a unary minus sign are negated.
Expressions involving +, --, *, or / operators are evaluated; for example:
PARAMETER (NN=27) I = 2*NN+J ! Becomes: I = 54 + J
Evaluation of some constant functions and operators is performed at compile time. This includes certain functions of constants, concatenation of string constants, and logical and relational operations involving constants.
Lower-ranked constants are converted to the data type of the higher-ranked operand:
REAL X, Y X = 10 * Y ! Becomes: X = 10.0 * Y
Array address calculations involving constant subscripts are simplified at compile time whenever possible:
INTEGER I(10,10) I(1,2) = I(4,5) ! Compiled as a direct load and store

Algebraic Reassociation Optimizations

Compaq Fortran delays operations to see whether they have no effect or can be transformed to have no effect. If they have no effect, these operations are removed. A typical example involves unary minus and .NOT. operations:

X = -Y * -Z ! Becomes: Y * Z

5.7.2.4 Value Propagation

Compaq Fortran tracks the values assigned to variables and constants, including those from DATA statements, and traces them to every place they are used. Compaq Fortran uses the value itself when it is more efficient to do so.

When compiling subprograms, Compaq Fortran analyzes the program to ensure that propagation is safe if the subroutine is called more than once.

Value propagation frequently leads to more value propagation. Compaq Fortran can eliminate run-time operations, comparisons and branches, and whole statements.

In the following example, constants are propagated, eliminating multiple operations from run time:

Original Code Optimized Code

PI = 3.14 .
.
.
PIOVER2 = PI/2 .
.
.
I = 100 .
.
.
IF (I.GT.1) GOTO 10
10 A(I) = 3.0*Q .
.
.
PIOVER2 = 1.57 .
.
.
I = 100 .
.
.
10 A(100) = 3.0*Q

Original Code	Optimized Code
`PI = 3.14` . . . `PIOVER2 = PI/2` . . . `I = 100` . . . `IF (I.GT.1) GOTO 10` `10 A(I) = 3.0*Q`	. . . `PIOVER2 = 1.57` . . . `I = 100` . . . `10 A(100) = 3.0*Q`

5.7.2.5 Dead Store Elimination

If a variable is assigned but never used, Compaq Fortran eliminates the entire assignment statement:

X = Y*Z . . .=Y*Z is eliminated. X = A(I,J)* PI

Some programs used for performance analysis often contain such unnecessary operations. When you try to measure the performance of such programs compiled with Compaq Fortran, these programs may show unrealistically good performance results. Realistic results are possible only with program units using their results in output statements.

5.7.2.6 Register Usage

A large program usually has more data that would benefit from being held in registers than there are registers to hold the data. In such cases, Compaq Fortran typically tries to use the registers according to the following descending priority list:

For temporary operation results, including array indexes
For variables
For addresses of arrays (base address)
All other usages

Compaq Fortran uses heuristic algorithms and a modest amount of computation to attempt to determine an effective usage for the registers.

Holding Variables in Registers

Because operations using registers are much faster than using memory, Compaq Fortran generates code that uses the Alpha 64-bit integer and floating-point registers instead of memory locations. Knowing when Compaq Fortran uses registers may be helpful when doing certain forms of debugging.

Compaq Fortran uses registers to hold the values of variables whenever the Fortran language does not require them to be held in memory, such as holding the values of temporary results of subexpressions, even if /NOOPTIMIZE (same as /OPTIMIZE=LEVEL=0 or no optimization) was specified.

Compaq Fortran may hold the same variable in different registers at different points in the program:

V = 3.0*Q . . . X = SIN(Y)*V . . . V = PI*X . . . Y = COS(Y)*V

Compaq Fortran may choose one register to hold the first use of V and another register to hold the second. Both registers can be used for other purposes at points in between. There may be times when the value of the variable does not exist anywhere in the registers. If the value of V is never needed in memory, it is never stored.

Compaq Fortran uses registers to hold the values of I, J, and K (so long as there are no other optimization effects, such as loops involving the variables):

A(I) = B(J) + C(K)

More typically, an expression uses the same index variable:

A(K) = B(K) + C(K)

In this case, K is loaded into only one register and is used to index all three arrays at the same time.

5.7.2.7 Mixed Real/Complex Operations

In mixed REAL/COMPLEX operations, Compaq Fortran avoids the conversion and performs a simplified operation on:

Add (+), subtract (--), and multiply (*) operations if either operand is REAL
Divide (/) operations if the right operand is REAL

For example, if variable R is REAL and A and B are COMPLEX, no conversion occurs with the following:

COMPLEX A, B . . . B = A + R

5.7.3 Global Optimizations

To enable global optimizations, use /OPTIMIZE=LEVEL=2 or a higher optimization level (LEVEL=3, LEVEL=4, or LEVEL=5). Using /OPTIMIZE= LEVEL=2 or higher also enables local optimizations (LEVEL=1).

Global optimizations include:

Data-flow analysis
Split lifetime analysis
Strength reduction (replaces a CPU-intensive calculation with one that uses fewer CPU cycles)
Code motion (also called code hoisting)
Instruction scheduling

Data-flow and split lifetime analysis (global data analysis) traces the values of variables and whole arrays as they are created and used in different parts of a program unit. During this analysis, Compaq Fortran assumes that any pair of array references to a given array might access the same memory location, unless a constant subscript is used in both cases.

To eliminate unnecessary recomputations of invariant expressions in loops, Compaq Fortran hoists them out of the loops so they execute only once.

Global data analysis includes which data items are selected for analysis. Some data items are analyzed as a group and some are analyzed individually. Compaq Fortran limits or may disqualify data items that participate in the following constructs, generally because it cannot fully trace their values.

Data items in the following constructs can make global optimizations less effective:

VOLATILE declarations
VOLATILE declarations are needed to use certain run-time features of the operating system. Declare a variable as VOLATILE if the variable can be accessed using rules in addition to those provided by the Fortran 90/95 language. Examples include:
- COMMON data items or entire common blocks that can change value by means other than direct assignment or during a routine call. For example, if a variable in COMMON can change value by means of an OpenVMS AST, you must declare the variable or the COMMON block to which it belongs as volatile.
- Variables read or written by an AST routine or a condition handler, including those in a common block or module.
- An address not saved by the %LOC built-in function.
As requested by the VOLATILE statement, Compaq Fortran disqualifies any volatile variables from global data analysis.
Subroutine calls or external function references
Compaq Fortran cannot trace data flow in a called routine that is not part of the program unit being compiled, unless the same FORTRAN command compiled multiple program units (see Section 5.1.2). Arguments passed to a called routine that are used again in a calling program are assumed to be modified, unless the proper INTENT is specified in an interface block (the compiler must assume they are referenced by the called routine).
Common blocks
Compaq Fortran limits optimizations on data items in common blocks. If common block data items are referenced inside called routines, their values might be altered. In the following example, variable I might be altered by FOO, so Compaq Fortran cannot predict its value in subsequent references.
COMMON /X/ I DO J=1,N I = J CALL FOO A(I) = I ENDDO
Variables in Fortran 90/95 modules
Compaq Fortran limits optimizations on variables in Fortran 90/95 modules. Like common blocks, if the variables in Fortran modules are referenced inside called routines, their values might be altered.
Variables referenced by a %LOC built-in function or variables with the TARGET attribute
Compaq Fortran limits optimizations on variables indirectly referenced by a %LOC function or variables with the TARGET attribute, because the called routine may dereference the pointer to such a variable.
Equivalence groups
An equivalence group is formed explicitly with the EQUIVALENCE statement or implicitly by the COMMON statement. A program section is a particular common block or local data area for a particular routine. Compaq Fortran combines equivalence groups within the same program section and in the same program unit.
The equivalence groups in separate program sections are analyzed separately, but the data items within each group are not, so some optimizations are limited to the data within each group.

5.7.4 Additional Global Optimizations

To enable additional global optimizations, use /OPTIMIZE=LEVEL=3 or a higher optimization level (LEVEL=4 or LEVEL=5). Using /OPTIMIZE= LEVEL=3 or higher also enables local optimizations (LEVEL=1) and global optimizations (LEVEL=2).

Additional global optimizations improve speed at the cost of longer compile times and possibly extra code size.

5.7.4.1 Loop Unrolling

At optimization level /OPTIMIZE=LEVEL=3 or above, Compaq Fortran attempts to unroll certain innermost loops, minimizing the number of branches and grouping more instructions together to allow efficient overlapped instruction execution (instruction pipelining). The best candidates for loop unrolling are innermost loops with limited control flow.

As more loops are unrolled, the average size of basic blocks increases. Loop unrolling generates multiple copies of the code for the loop body (loop code iterations) in a manner that allows efficient instruction pipelining.

The loop body is replicated a certain number of times, substituting index expressions. An initialization loop might be created to align the first reference with the main series of loops. A remainder loop might be created for leftover work.

The number of times a loop is unrolled can be determined either by the optimizer or by using the /OPTIMIZE=UNROLL=n qualifier, which can specify the limit for loop unrolling. Unless the user specifies a value, the optimizer unrolls a loop four times for most loops or two times for certain loops (large estimated code size or branches out the loop).

Array operations are often represented as a nested series of loops when expanded into instructions. The innermost loop for the array operation is the best candidate for loop unrolling (like DO loops). For example, the following array operation (once optimized) is represented by nested loops, where the innermost loop is a candidate for loop unrolling:

A(1:100,2:30) = B(1:100,1:29) * 2.0

5.7.4.2 Code Replication to Eliminate Branches

In addition to loop unrolling and other optimizations, the number of branches are reduced by replicating code that will eliminate branches. Code replication decreases the number of basic blocks and increases instruction-scheduling opportunities.

Code replication normally occurs when a branch is at the end of a flow of control, such as a routine with multiple, short exit sequences. The code at the exit sequence gets replicated at the various places where a branch to it might occur.

For example, consider the following unoptimized routine and its optimized equivalent that uses code replication (R0 is register 0):

Unoptimized Instructions Optimized (Replicated) Instructions

.
.
.
branch to exit1
.
.
.
branch to exit1
.
.
.
exit1: move 1 into R0
return

.
.
.
move 1 into R0
return
.
.
.
move 1 into R0
return
.
.
.
move 1 into R0
return

Unoptimized Instructions	Optimized (Replicated) Instructions
. . . branch to exit1 . . . branch to exit1 . . . exit1: move 1 into R0 return	. . . move 1 into R0 return . . . move 1 into R0 return . . . move 1 into R0 return

Similarly, code replication can also occur within a loop that contains a small amount of shared code at the bottom of a loop and a case-type dispatch within the loop. The loop-end test-and-branch code might be replicated at the end of each case to create efficient instruction pipelining within the code for each case.

5.7.5 Automatic Inlining

To enable optimizations that perform automatic inlining, use /OPTIMIZE=LEVEL=4 or a higher optimization level (LEVEL=5). Using /OPTIMIZE=LEVEL=4 also enables local optimizations (LEVEL=1), global optimizations (LEVEL=2), and additional global optimizations (LEVEL=3).

The default is /OPTIMIZE=LEVEL=4 (same as /OPTIMIZE).

5.7.5.1 Interprocedure Analysis

Compiling multiple source files at optimization level /OPTIMIZE=LEVEL=4 or higher lets the compiler examine more code for possible optimizations, including multiple program units. This results in:

Inlining more procedures
More complete global data analysis
Reducing the number of external references to be resolved during linking

As more procedures are inlined, the size of the executable program and compile times may increase, but execution time should decrease.

5.7.5.2 Inlining Procedures

Inlining refers to replacing a subprogram reference (such as a CALL statement or function invocation) with the replicated code of the subprogram. As more procedures are inlined, global optimizations often become more effective.

The optimizer inlines small procedures, limiting inlining candidates based on such criteria as:

Estimated size of code
Number of call sites
Use of constant arguments

You can specify:

The /OPTIMIZE=LEVEL=n qualifier to control the optimization level. For example, specifying /OPTIMIZE=LEVEL=4 or higher enables interprocedure optimizations.
Different /OPTIMIZE=LEVEL=n keywords set different levels of inlining. For example, /OPTIMIZE=LEVEL=4 sets /OPTIMIZE=INLINE=SPEED.
One of the /OPTIMIZE=INLINE=xxxxx keywords to directly control the inlining of procedures (see Section 5.8.5). For example, /OPTIMIZE=INLINE=SPEED inlines more procedures than /OPTIMIZE=INLINE=SIZE.

For More Information:

On Compaq Fortran statements, see the Compaq Fortran Language Reference Manual.
On controlling inlining using /OPTIMIZE=INLINE=keyword, see Section 5.8.5.

5.7.6 Loop Transformation and Software Pipelining

A group of optimizations known as loop transformation optimizations and software pipelining with its associated additional software dependence analysis are enabled by using the /OPTIMIZE=LEVEL=5 qualifier. In certain cases, this improves run-time performance.

The loop transformation optimizations apply to array references within loops and can apply to multiple nested loops. These optimizations can improve the performance of the memory system.

Software pipelining applies instruction scheduling to certain innermost loops, allowing instructions within a loop to "wrap around" and execute in a different iteration of the loop. This can reduce the impact of long-latency operations, resulting in faster loop execution.

Software pipelining also enables the prefetching of data to reduce the impact of cache misses.

For More Information:

On loop transformations, see Section 5.8.1.
On software pipelining, see Section 5.8.2.

5.8 Other Qualifiers Related to Optimization

In addition to the /OPTIMIZE=LEVEL qualifiers (discussed in Section 5.7), several other FORTRAN command qualifiers and /OPTIMIZE keywords can prevent or facilitate improved optimizations.

5.8.1 Loop Transformation

The loop transformation optimizations are enabled by using the /OPTIMIZE=LOOPS qualifier or the /OPTIMIZE=LEVEL=5 qualifier. Loop transformation attempts to improve performance by rewriting loops to make better use of the memory system. By rewriting loops, the loop transformation optimizations can increase the number of instructions executed, which can degrade the run-time performance of some programs.

To request loop transformation optimizations without software pipelining, do one of the following:

Specify /OPTIMIZE=LEVEL=5 with /OPTIMIZE=NOPIPELINE (preferred method)
Specify /OPTIMIZE=LOOPS with /OPTIMIZE=LEVEL=4, LEVEL=3, or LEVEL=2. This optimization is not performed at optimization levels below LEVEL=2.

The loop transformation optimizations apply to array references within loops. These optimizations can improve the performance of the memory system and usually apply to multiple nested loops. The loops chosen for loop transformation optimizations are always counted loops. Counted loops use a variable to count iterations, thereby determining the number before entering the loop. For example, most DO loops are counted loops.

Conditions that typically prevent the loop transformation optimizations from occurring include subprogram references that are not inlined (such as an external function call), complicated exit conditions, and uncounted loops.

The types of optimizations associated with /OPTIMIZE=LOOPS include the following:

Loop blocking---Can minimize memory system use with multidimensional array elements by completing as many operations as possible on array elements currently in the cache. Also known as loop tiling.
Loop distribution---Moves instructions from one loop into separate, new loops. This can reduce the amount of memory used during one loop so that the remaining memory may fit in the cache. It can also create improved opportunities for loop blocking.
Loop fusion---Combines instructions from two or more adjacent loops that use some of the same memory locations into a single loop. This can avoid the need to load those memory locations into the cache multiple times and improves opportunities for instruction scheduling.
Loop interchange---Changes the nesting order of some or all loops. This can minimize the stride of array element access during loop execution and reduce the number of memory accesses needed. Also known as loop permutation.
Scalar replacement---Replaces the use of an array element with a scalar variable under certain conditions.
Outer loop unrolling---Unrolls the outer loop inside the inner loop under certain conditions to minimize the number of instructions and memory accesses needed. This also improves opportunities for instruction scheduling and scalar replacement.

For More Information:

On the interaction of command-line options and timing programs compiled with the loop transformation optimizations, see Section 5.7.

5.8.2 Software Pipelining

Software pipelining and additional software dependence analysis are enabled by using the /OPTIMIZE=PIPELINE qualifier or by the /OPTIMIZE=LEVEL=5 qualifier. Software pipelining in certain cases improves run-time performance.

The software pipelining optimization applies instruction scheduling to certain innermost loops, allowing instructions within a loop to "wrap around" and execute in a different iteration of the loop. This can reduce the impact of long-latency operations, resulting in faster loop execution.

Loop unrolling (enabled at /OPTIMIZE=LEVEL=3 or above) cannot schedule across iterations of a loop. Because software pipelining can schedule across loop iterations, it can perform more efficient scheduling to eliminate instruction stalls within loops.

For instance, if software dependence analysis of data flow reveals that certain calculations can be done before or after that iteration of the loop, software pipelining reschedules those instructions ahead of or behind that loop iteration, at places where their execution can prevent instruction stalls or otherwise improve performance.

Software pipelining also enables the prefetching of data to reduce the impact of cache misses.

Software pipelining can be more effective when you combine /OPTIMIZE=PIPELINE (or /OPTIMIZE=LEVEL=5) with the appropriate OPTIMIZE=TUNE=keyword for the target Alpha processor generation (see Section 5.8.6).

To specify software pipelining without loop transformation optimizations, do one of the following:

Specify /OPTIMIZE=LEVEL=5 with /OPTIMIZE=NOLOOPS (preferred method)
Specify /OPTIMIZE=PIPELINE with /OPTIMIZE=LEVEL=4, /OPTIMIZE=LEVEL=3, or /OPTIMIZE=LEVEL=2. This optimization is not performed at optimization levels below LEVEL=2.

For this version of Compaq Fortran, loops chosen for software pipelining:

Are always innermost loops (those executed the most).
Do not contain branches or procedure calls.
Do not use COMPLEX floating-point data.

By modifying the unrolled loop and inserting instructions as needed before and/or after the unrolled loop, software pipelining generally improves run-time performance, except where the loops contain a large number of instructions with many existing overlapped operations. In this case, software pipelining may not have enough registers available to effectively improve execution performance. Run-time performance using /OPTIMIZE=LEVEL=5 (or /OPTIMIZE=PIPELINE) may not improve performance, as compared to using /OPTIMIZE=LEVEL=4).

For programs that contain loops that exhaust available registers, longer execution times may result with /OPTIMIZE=LEVEL=5 or /OPTIMIZE=PIPELINE. In cases where performance does not improve, consider compiling with the OPTIMIZE=UNROLL=1 qualifier along with /OPTIMIZE=LEVEL=5 or /OPTIMIZE=PIPELINE, to possibly improve the effects of software pipelining.

For More Information:

On the interaction of command-line options and timing programs compiled with software pipelining, see Section 5.7.

5.8.3 Setting Multiple Qualifiers with the /FAST Qualifier

Specifying the /FAST qualifier sets the following qualifiers:

/ALIGNMENT=COMMONS=NATURAL (see Section 5.3)
/ASSUME=NOACCURACY_SENSITIVE (see Section 5.8.8)
/MATH_LIBRARY=FAST (see Section 2.3.27)

5.8.4 Controlling Loop Unrolling

You can specify the number of times a loop is unrolled by using the /OPTIMIZE=UNROLL=n qualifier (see Section 2.3.32).

Using /OPTIMIZE=UNROLL=n can also influence the run-time results of software pipelining optimizations performed when you specify /OPTIMIZE=LEVEL=5.

Although unrolling loops usually improves run-time performance, the size of the executable program may increase.

For More Information:

On loop unrolling, see Section 5.7.4.1.

5.8.5 Controlling the Inlining of Procedures

To specify the types of procedures to be inlined, use the /OPTIMIZE=INLINE=keyword keywords. Also, compile multiple source files together and specify an adequate optimization level, such as /OPTIMIZE=LEVEL=4.

If you omit /OPTIMIZE=INLINE=keyword, the optimization level /OPTIMIZE=LEVEL=n qualifier used determines the types of procedures that are inlined.

The /OPTIMIZE=INLINE=keyword keywords are as follows:

NONE (same as /OPTIMIZE=NOINLINE) inlines statement functions but not other procedures. This type of inlining occurs if you specify /OPTIMIZE=LEVEL=0, LEVEL=1, LEVEL=2, or LEVEL=3 and omit INLINE=keyword.
MANUAL (same as NONE) inlines statement functions but not other procedures. This type of inlining occurs if you specify /OPTIMIZE=LEVEL=2 or LEVEL=3 and omit INLINE=keyword.
In addition to inlining statement functions, SIZE inlines any procedures that the Compaq Fortran optimizer expects will improve run-time performance with no likely significant increase in program size.
In addition to inlining statement functions, SPEED inlines any procedures that the Compaq Fortran optimizer expects will improve run-time performance with a likely significant increase in program size. This type of inlining occurs if you specify /OPTIMIZE=LEVEL=4 or LEVEL=5 and omit /OPTIMIZE=INLINE=keyword.
ALL inlines every call that can possibly be inlined while generating correct code, including the following:
- Statement functions (always inlined).
- Any procedures that Compaq Fortran expects will improve run-time performance with a likely significant increase in program size.
- Any other procedures that can possibly be inlined and generate correct code. Certain recursive routines are not inlined to prevent infinite expansion.

For information on the inlining of other procedures (inlined at optimization level /OPTIMIZE=LEVEL=4 or higher), see Section 5.7.5.2.

Maximizing the types of procedures that are inlined usually improves run-time performance, but compile-time memory usage and the size of the executable program may increase.

To determine whether using /OPTIMIZE=INLINE=ALL benefits your particular program, time program execution for the same program compiled with and without /OPTIMIZE=INLINE=ALL.

Contents

Index

Compaq FortranUser Manual for OpenVMS Alpha Systems

5.7.2.2 Integer Multiplication and Division Expansion

5.7.4 Additional Global Optimizations

5.7.6 Loop Transformation and Software Pipelining

5.8 Other Qualifiers Related to Optimization

5.8.4 Controlling Loop Unrolling

Compaq Fortran
User Manual for
OpenVMS Alpha Systems