CONTENTS Title Page Copyright Page Preface 1 When an Existing Application is Too Slow 1.1 Programming Languages and HPO Compatibility 1.2 The Importance of an Efficient Algorithm 1.3 Programming Techniques that Improve Performance 1.3.1 Measuring Performance 1.3.2 Choosing a Solution 1.3.2.1 Parallel Processing 1.3.2.2 Vector Processing 1.3.2.3 Optimization 1.3.2.4 Efficient I/O Processing 2 Scalar Optimization 2.1 Global Data-Flow Analysis 2.2 Speed Optimizations 2.2.1 Inline Expansion of Statement Functions 2.2.2 Removal Optimizations 2.2.2.1 Compile-Time Operations 2.2.2.2 Flow Boolean Operations 2.2.2.3 Compound Logical Expressions in IF Statements 2.2.2.4 Common Subexpression Elimination 2.2.2.5 Code Hoisting 2.2.2.6 Value Propagation 2.2.2.7 Dead Store Elimination 2.2.3 Replacement Optimizations 2.2.3.1 Implied-DO Loop Collapsing 2.2.3.2 Store Delaying Optimizations 2.2.3.3 Register Usage 2.2.3.4 Autoincrement and Autodecrement Mode Addressing 2.2.3.5 Strength Reduction 2.2.3.6 Block Moves and Block Initializations 2.2.3.7 Constants as Literals 2.2.3.8 JSB for Floating Math Functions 2.2.3.9 Code Alignment 2.2.3.10 SIN and COS Functions 2.2.3.11 Mixed Real/Complex Operations 2.2.3.12 Peepholes 2.3 Space Optimizations 2.4 Scalar Optimization Example 3 Parallel and Vector Optimization 3.1 Dependence Analysis 3.1.1 Locating Data Dependences 3.1.2 Determining the Direction of Dependences 3.1.3 Evaluating Other Characteristics of Dependences 3.1.3.1 Loop-Carried Dependences 3.1.3.2 Loop-Independent Dependences 3.1.3.3 Distance Between Dependent Pairs 3.1.3.4 Cycles of Dependences 3.1.3.5 Uniform and Nonuniform Dependences 3.1.4 Resolving Dependences 3.2 Performance Analysis 3.3 Loop Structuring and Selection 3.3.1 Interchanging Loops 3.3.1.1 Vectorizing an Outer Loop 3.3.1.2 Serial Loop Innermosting 3.3.1.3 Moving a Parallel Loop Outward 3.3.2 Distributing the Loop 3.4 Scheduling Vector Instructions 3.5 Inlining Intrinsic Functions and BLAS Routines 3.6 Echelon Parallelism 3.7 Guided Self-Scheduling 3.8 Vectorizing and Decomposing the Same Loop 3.9 Vectorizing Difficult Constructs 3.9.1 Conditionals (IF Constructs) 3.9.2 Reduction Operations 3.9.3 Indirect Addressing 3.9.4 User-Supplied Induction Variables 3.9.5 Temporary Scalar Results 3.9.6 First-Order Linear Recurrences 4 Decomposing Loops for Parallel Processing 4.1 Resource Allocation for Parallel Processing 4.1.1 Resource Allocation by Program 4.1.2 SYSGEN Parameters 4.1.3 User Account Quotas 4.1.4 Working Set Size 4.2 Autodecomposition 4.3 Directed Decomposition 4.3.1 Resolving Dependences Manually 4.3.1.1 Resolving Dependences Involving Temporary Variables 4.3.1.2 Resolving Loop-Carried Dependences 4.3.1.2.1 Loop Alignment 4.3.1.2.2 Code Replication 4.3.1.2.3 Loop Distribution 4.3.1.3 Restructuring a Loop into an Inner and Outer Nest 4.3.1.4 Dependences Requiring Locks 4.3.2 Coding Restrictions 4.3.3 Manual Optimization 4.3.3.1 Interchanging Loops 4.3.3.2 Balancing the Workload 4.3.4 Adjusting the Run-time Environment 4.4 Directed Decomposition Examples 4.5 Calls to Programs Written in Other Languages 5 Using Vectorization and Vectorization-Decomposition 5.1 Using Vectorization 5.2 Combining Decomposition with Vectorization 6 Improving I/O Performance 6.1 FORTRAN Programming Techniques 6.1.1 Specifying BLOCKSIZE and BUFFERCOUNT 6.1.2 Specifying INITIALSIZE and EXTENDSIZE 6.1.3 Using Unformatted Files 6.1.4 Writing Whole Arrays or Strings 6.1.5 Writing Data in the ``Natural'' Storage Order 6.1.6 Using SHARED Records Only When Required 6.1.7 Using Memory for Intermediate Results 6.1.8 Allowing Implied-DO Loop Collapsing 6.1.9 Specifying RECL 6.1.10 Using Variable Format Expressions 6.2 Other Programming Techniques 7 Removing Performance Inhibitors 7.1 Inhibitors to Scalar Optimization 7.1.1 Code that Prevents Inline Expansion 7.1.2 Assigned GOTOs 7.1.3 EQUIVALENCE Statement 7.1.4 Very Large Programs 7.1.5 VOLATILE Statement 7.2 Inhibitors to Parallel and Vector Performance 7.2.1 Unknown Dependences 7.2.1.1 Applying Assertions 7.2.1.2 Using LSE to Review Diagnostics 7.2.1.3 Nonuniform Unknown Dependences 7.2.1.4 Unknown Dependences Defined by Input Values 7.2.2 Recurrences 7.2.3 Inefficient Memory Access 7.2.3.1 Cache Performance 7.2.3.2 Paging System Performance 7.2.4 Calls in Loops 7.2.5 VVIEF on VAX Multiprocessors 7.2.6 Specific Vectorization Inhibitors 7.2.6.1 Inhibitors that Disqualify an Entire Loop 7.2.6.2 Nonvectorizable Constructs 7.2.6.3 Misaligned Data 7.2.7 Vector Processing Performance Inhibitors 7.2.7.1 Very Long or Short Loops 7.2.7.2 Conditionals that Rarely Execute 7.2.7.3 SIN, COS, and TAN Functions 8 Accuracy, Exception Reporting, and Debugging 8.1 Accuracy 8.1.1 Computer Arithmetic 8.1.1.1 Conversion 8.1.1.2 Round-off Errors 8.1.2 Using /[NO]DUMMY_ALIASES 8.1.3 Using [NO]ACCURACY_SENSITIVE 8.1.4 Using Compound Logical Expressions in IF Statements 8.2 Exception Reporting 8.3 Testing and Debugging Programs 8.3.1 Debugging and Testing Scalar-Optimized Programs 8.3.2 Decomposed Programs 8.3.3 Vectorized Programs A Qualifiers for Decomposition and Vectorization GLOSSARY EXAMPLES 1-1 Simple Linear Search 1-2 Binary Search 2-1 RELAX Source Program 2-2 RELAX Machine Code (Optimized) 3-1 Evaluating Data Dependences 3-2 Establishing the Direction of Dependences 3-3 Decomposed/Vectorized Loop Structure 3-4 Loop-Interchanged Structure 3-5 Parallel Loop Moved Outward 3-6 Parallel and Vector Loop Interchange 3-7 Distributed Loop 3-8 Scheduled Vector Instructions 3-9 Loop Using Echelon Parallelism 3-10 A Decomposed and Vectorized Loop 3-11 Listing File Showing Masked Vector Operation 3-12 Reduction Operations that Vectorize 3-13 Gather/Scatter Operations 3-14 Scalar Expansion 4-1 Loops Considered for Autodecomposition 4-2 Aligned Loop 4-3 Transformed Loop Using Code Replication 4-4 Distributed Loop 4-5 Decomposed Loop Using Locks 4-6 Balancing the Workload Using NWORKERS 4-7 Manually Decomposed Program Using Matrix Arithmetic 4-8 Livermore Loop, Kernels 3 and 5 7-1 Decomposition and Vectorization Inhibitor Diagnostics 7-2 Unusable Assertions 7-3 Report of Unknown Dependence 7-4 Report of Unknown Nonuniform Dependences 7-5 Call Inhibiting Vectorization 7-6 Revised Structure Allowing More Vectorization 7-7 Recoded Loop with Wraparound Scalar 7-8 Listing File Documenting Misaligned Data 8-1 Using /ASSUME=[NO]DUMMY_ALIASES 8-2 A Manually Decomposed Program 8-3 Debugging Session FIGURES 1-1 A Decision Tree for Computational Solutions 1-2 Processes in a Parallel Program 1-3 Vector Processing Architecture Model 3-1 Direction of Dependences in Example 3-2 3-2 Direction of a Loop-Carried Dependence 3-3 Direction of a Loop-Independent and Operand Dependence 3-4 Direction of a Cycle of Dependences 3-5 Direction of Dependences in Multiple Arrays 3-6 Direction of a Data Recurrence 3-7 Direction of a Uniform Dependence 3-8 Direction of a Nonuniform Dependence 3-9 Loop with Multiple Data Dependences 3-10 Sectioning a Vectorized Loop 3-11 Overlapping and Chaining in Vector Operations 3-12 Synchronization of Processes 3-13 Masked Vector Operation 4-1 Lock Options in a Decomposed Loop 7-1 VAX Hierarchical Memory 7-2 Vector Performance on Symmetrical Multiprocessors 7-3 Misaligned Data 7-4 Effect of Loop Length on Vector Performance TABLES 1-1 I/O Versus CPU Time in Serial and Parallel Processing 3-1 /SHOW=LOOPS Labels 3-2 BLAS Routines that Can be Inlined 3-3 First-Order Linear Recurrences that Use Vectorization 3-4 Recurrences that Do and Do Not Benefit from Vectorization 4-1 Global Section Parameter Values 4-2 Parallel Compiler Directives 4-3 Logicals That Adjust the Run-Time Environment 5-1 Qualifiers for Parallel-Vector Processing 6-1 Relative CPU Time 6-2 High-Performance Values for RECL A-1 VAX FORTRAN HPO Parallel and Vector Qualifiers