HyperReader ... DEC Fortran Performance Guide for OpenVMS VAX Systems

  CONTENTS

  Title Page

  Copyright Page

  Preface

  1      When a Program is Too Slow: Techniques for Improving Performance

  1.1     Measuring Performance

  1.2     Choosing a Solution
    1.2.1      Parallel Processing
    1.2.2      Vector Processing
    1.2.3      Optimization
    1.2.4      Efficient I/O Processing

  2      Scalar Optimization

  2.1     Global Data-Flow Analysis

  2.2     Speed Optimizations
    2.2.1      Inlining Statement Functions
    2.2.2      Removal Optimizations
      2.2.2.1      Compile-Time Operations
      2.2.2.2      Flow Boolean Operations
      2.2.2.3      Compound Logical Expressions in Conditionals
      2.2.2.4      Common Subexpression Removal
      2.2.2.5      Code Hoisting
      2.2.2.6      Value Propagation
      2.2.2.7      Dead Store Elimination
    2.2.3      Replacement Optimizations
      2.2.3.1      Implied-DO Loop Collapsing
      2.2.3.2      Store Delaying Optimizations
      2.2.3.3      Register Usage
      2.2.3.4      Strength Reduction
      2.2.3.5      Block Moves and Initializations
      2.2.3.6      Constants as Literals
      2.2.3.7      JSB for Floating Math Functions
      2.2.3.8      Code Alignment
      2.2.3.9      SIN and COS Functions
      2.2.3.10     Mixed Real/Complex Operations
      2.2.3.11     Peepholes

  2.3     Memory Hierarchy Optimizations

  2.4     Space Optimizations

  2.5     Scalar Optimization Example (RELAX Program)

  3      Parallel, Vector, and Blocking Optimizations

  3.1     Dependence Analysis
    3.1.1      Finding Dependences
    3.1.2      Determining the Direction of Dependences
    3.1.3      Evaluating Other Dependence Characteristics
      3.1.3.1      Loop-Carried Dependences
      3.1.3.2      Loop-Independent Dependences
      3.1.3.3      Distance Between Dependent Pairs
      3.1.3.4      Cycles of Dependences
      3.1.3.5      Uniform and Nonuniform Dependences
    3.1.4      Resolving Dependences

  3.2     Performance Analysis:  Loop Structuring and Selection
    3.2.1      Interchanging Loops
      3.2.1.1      Vectorizing an Outer Loop
      3.2.1.2      Serial Loop Innermosting
      3.2.1.3      Moving a Parallel Loop Outward
    3.2.2      Distributing Loops

  3.3     Scheduling Vector Instructions

  3.4     Inlining Intrinsic Functions and BLAS Routines

  3.5     Echelon Parallelism

  3.6     Guided Self-Scheduling

  3.7     Vectorizing and Decomposing the Same Loop

  3.8     Vectorizing Difficult Constructs
    3.8.1      Conditionals (IF Constructs)
    3.8.2      Reduction Operations
    3.8.3      Indirect Addressing
    3.8.4      User-Supplied Induction Variables
    3.8.5      Scalar Expansion
    3.8.6      First-Order Linear Recurrences

  4      Decomposing Loops for Parallel Processing

  4.1     Resource Allocation for Parallel Processing
    4.1.1      Resource Allocation by Program
    4.1.2      SYSGEN Parameters
    4.1.3      User Account Quotas
    4.1.4      Working Set Size

  4.2     Detached and Repetitive Execution and Memory Management
    4.2.1      Executing Parallel Applications in Quick Succession
    4.2.2      Executing Parallel Applications as a Detached Process
    4.2.3      Using OpenVMS Memory Management and the /PARALLEL Qualifier

  4.3     Autodecomposition

  4.4     Directed Decomposition
    4.4.1      Resolving Dependences Manually
      4.4.1.1      Resolving Dependences Involving Temporary Variables
      4.4.1.2      Resolving Loop-Carried Dependences
      4.4.1.3      Restructuring a Loop into an Inner and Outer Nest
      4.4.1.4      Dependences Requiring Locks
    4.4.2      Coding Restrictions
    4.4.3      Manual Optimization
      4.4.3.1      Interchanging Loops
      4.4.3.2      Balancing the Workload
    4.4.4      Adjusting the Run-Time Environment

  4.5     Directed Decomposition Examples

  4.6     Calls to Programs Written in Other Languages

  5      Using Vectorization and Vectorization-Decomposition

  5.1     Using Vectorization

  5.2     Combining Decomposition with Vectorization
    5.2.1      Memory Usage and Startup Overhead
    5.2.2      CPU Time Versus Wall-Clock Time

  5.3     VVIEF on VAX Multiprocessors

  6      Removing Performance Inhibitors

  6.1     Scalar Optimization Inhibitors
    6.1.1      EQUIVALENCE Statements
    6.1.2      Assigned GOTOs
    6.1.3      VOLATILE Statement
    6.1.4      Very Large Programs
    6.1.5      Code Preventing Inline Expansion

  6.2     Inefficient Memory Access
    6.2.1      Cache Performance in Decomposed Loops
    6.2.2      Paging System Performance
    6.2.3      Reducing Page Faults in Large Array Calculations

  6.3     Parallel and Vector Optimization Inhibitors
    6.3.1      Unknown Dependences
      6.3.1.1      Nonuniform Dependences
      6.3.1.2      Unknown Dependences Defined by Input Values
      6.3.1.3      Calls in Loops
    6.3.2      Recurrences
      6.3.2.1      Nonvectorizable Recurrence
      6.3.2.2      Potential Recurrence
      6.3.2.3      Recurrence in a Two-Dimensional Array
      6.3.2.4      Nonuniform Dependence, Crossing
      6.3.2.5      Nonuniform Dependence, Indirect
      6.3.2.6      First Sum
      6.3.2.7      Vectorizable Second-Level Recurrence
    6.3.3      Specific Vectorization Inhibitors
      6.3.3.1      Inhibitors that Disqualify an Entire Loop
      6.3.3.2      Nonvectorizable Constructs
      6.3.3.3      Misaligned Data
      6.3.3.4      Vector Alignment Fault Handler
    6.3.4      Using LSE to Review Diagnostics
    6.3.5      Using the ASSERT Directive and Statement
    6.3.6      Using the INIT_DEP_FWD Directive
      6.3.6.1      Interaction with Other Optimizations
      6.3.6.2      Interaction with Other Dependence Information
      6.3.6.3      Interaction with Other Vectorization Inhibitors
      6.3.6.4      Interaction with Diagnostics
      6.3.6.5      Correctness of Scalar and Vector Results
      6.3.6.6      Debugging
      6.3.6.7      Comparison with Other Implementations

  6.4     Vector Processing Performance Inhibitors
    6.4.1      Very Long Unblocked or Very Short Loops
    6.4.2      Conditionals that Rarely Execute
    6.4.3      SIN, COS, and TAN Functions
    6.4.4      Using the NOVECTOR Directive

  7      Improving I/O Performance

  7.1     FORTRAN Programming Techniques
    7.1.1      Specifying BLOCKSIZE and BUFFERCOUNT
    7.1.2      Specifying INITIALSIZE and EXTENDSIZE
    7.1.3      Using Unformatted Files
    7.1.4      Writing Whole Arrays or Strings
    7.1.5      Writing Data in the ``Natural'' Storage Order
    7.1.6      Using SHARED Records Only When Required
    7.1.7      Using Memory for Intermediate Results
    7.1.8      Allowing Implied-DO Loop Collapsing
    7.1.9      Specifying Record Length (RECL)
    7.1.10     Using Variable Format Expressions

  7.2     Other Programming Techniques
    7.2.1      USEROPEN
    7.2.2      Direct Calls to the Queue I/O Request System Service (SYS$QIO)
    7.2.3      Global Sections
    7.2.4      OpenVMS Lock Manager

  8      Accuracy, Exception Reporting, and Debugging

  8.1     Accuracy
    8.1.1      Computer Arithmetic
      8.1.1.1      Conversion
      8.1.1.2      Round-Off Errors During Conversion
      8.1.1.3      Round-off Errors During Calculation
    8.1.2      Using Dummy Aliases
    8.1.3      Choosing Accuracy Sensitivity
      8.1.3.1      Hoisting Divide Operations
      8.1.3.2      Sum and Product Reductions
      8.1.3.3      First-Order Linear Recurrences
    8.1.4      Using Compound Logical Expressions in Conditionals

  8.2     Exception Reporting

  8.3     Testing and Debugging
    8.3.1      Scalar-Optimized Programs
    8.3.2      Decomposed Programs
    8.3.3      Vectorized Programs
    8.3.4      Correctness of Scalar and Vector Results
    8.3.5      Debugging

  Glossary
    asynchronous read-ahead operation . . . condition handler
    critical region . . . Echelon Parallelism
    fault . . . locality of reference
    lock . . . miss
    natural boundary . . . parallel optimization
    parallel processing . . . replacement optimization
    resolve . . . shared
    space optimization . . . vector
    vectorization . . . wraparound scalar

  EXAMPLES

  2-1        RELAX Source Program

  2-2        RELAX Machine Code (Optimized)

  3-1        Dependences Analysis

  3-2        Determining the Direction of Dependences

  3-3        Decomposed/Vectorized Loop Structure

  3-4        Loop-Interchanged Structure

  3-5        Parallel Loop Moved Outward

  3-6        Parallel and Vector Loop Interchange

  3-7        Distributed Loop

  3-8        Scheduled Vector Instructions

  3-9        Loop Using Echelon Parallelism

  3-10       A Decomposed and Vectorized Loop

  3-11       Listing File Showing Masked Vector Operation

  3-12       Reduction Operations that Vectorize

  3-13       Gather/Scatter Operations

  3-14       Scalar Expansion

  4-1        Loops Considered for Autodecomposition

  4-2        Aligned Loop

  4-3        Transformed Loop Using Code Replication

  4-4        Distributed Loop

  4-5        Decomposed Loop Using Locks

  4-6        Balancing the Workload Using NWORKERS

  4-7        Manually Decomposed Program Using Matrix Arithmetic

  4-8        Livermore Loop, Kernels 3 and 5

  6-1        Decomposition and Vectorization Inhibitor Diagnostics

  6-2        Report of Unknown Nonuniform Dependences

  6-3        Call Inhibiting Vectorization

  6-4        Revised Structure Allowing More Vectorization

  6-5        Recoded Loop with Wraparound Scalar

  6-6        Listing File Documenting Misaligned Data

  6-7        Report of Unknown Dependence

  6-8        Unusable Assertions

  8-1        Using Dummy Aliases

  8-2        A Manually Decomposed Program

  8-3        Debugging Session

  FIGURES

  1-1        A Decision Tree for Computational Solutions

  1-2        Processes in a Parallel Program

  1-3        Vector Architecture

  2-1        Performance of Blocked and Unblocked Loops

  3-1        Direction of Dependences in Example 3-2

  3-2        Direction of a Loop-Carried Dependence

  3-3        Direction of a Loop-Independent and Operand Dependence

  3-4        Direction of a Cycle of Dependences

  3-5        Direction of Dependences in Multiple Arrays

  3-6        Direction of a Data Recurrence

  3-7        Direction of a Uniform Dependence

  3-8        Direction of a Nonuniform Dependence

  3-9        Loop with Multiple Dependences

  3-10       Sectioning a Vectorized Loop

  3-11       Overlapping and Chaining in Vector Operations

  3-12       Synchronization of Processes

  3-13       Masked Vector Operation

  4-1        Lock Options in a Decomposed Loop

  5-1        Vector Performance on Symmetrical Multiprocessors

  6-1        Hierarchical Memory System

  6-2        Misaligned Data

  TABLES

  1 Conventions Used in This Document

  1-1        I/O Versus CPU Time in Serial and Parallel Processing

  2-1        Code Hoisting

  3-1        /SHOW=LOOPS Labels

  3-2        First-Order Linear Recurrences Using Vectorization

  3-3        Recurrences that Do and Do Not Benefit from Vectorization

  4-1        Global Section Parameter Values

  4-2        Parallel Compiler Directives

  4-3        Logical Names that Adjust the Run-Time Environment

  5-1        Qualifiers for Parallel-Vector Processing

  7-1        Relative CPU Time

  7-2        High-Performance Values for RECL