CONTENTS Title Page Copyright Page Preface 1 Introduction to Dependable Computing 1.1 Levels of System Dependability Requirements 1.2 How Do You Build Dependable Systems? 1.3 Dependability Terms 1.4 Basic Concepts of Dependability 1.4.1 Primary Dependability Strategies 1.4.2 Redundant Functional Units 1.4.2.1 Manual Redundancy 1.4.2.2 Automatic Redundancy 1.4.2.3 Capacity-Related Redundancy 1.4.2.4 Redundancy in Computing Systems 1.5 How a Dependable System Supports the Business 1.6 Balance Is Critical to a Dependable System 2 Analyzing Dependable System Requirements 2.1 Dependability Is a Journey, Not a Destination 2.2 Determining Your Dependability Requirements 2.2.1 Collecting Requirements 2.2.2 Priority Requirements 2.2.3 Analyzing the Sample Company's Dependability Requirements 2.2.4 Generating the Sample Company's First Steps to Dependability 2.2.5 Mapping Options to Requirements for the Sample Company 3 Dependability Options of the System Building Blocks 3.1 Analyzing Environmental Options 3.1.1 Utilities 3.1.2 Structures 3.1.3 Networks 3.2 Analyzing Hardware Options 3.2.1 Starting with New Equipment 3.2.2 Making Do with Currently Installed Equipment 3.2.3 Investing a Little to Gain a Lot 3.3 Analyzing Communications Options 3.3.1 Restricting Access to the Computer Room 3.3.2 Providing Corporate Data to Local Personal Computers 3.3.3 Weaving a Tapestry of Computing Resources 3.3.4 Providing a System Network 3.4 Analyzing Software Options 3.4.1 Writing Custom Applications 3.4.2 Acquiring Software Packages 3.4.3 Selecting a Systems Integrator 3.5 Analyzing Operational Procedures Options 3.5.1 Avoiding User Errors 3.5.2 Training, Testing, and Drills 3.5.3 Extended Hours of Operator Coverage 3.5.4 Beepers for System Managers and Programmers 3.5.5 Lights Out Computer Facilities 3.5.6 Policies Regarding System Privileges 3.6 Analyzing Personnel Options 3.6.1 Teamwork Makes the System Work 3.6.2 Robots, We Are Not! 4 Balancing Dependability with Other Business Considerations 4.1 Identifying Constraints to Achieving a Dependable System 4.1.1 Performance Tradeoffs 4.1.1.1 Circuit Level Redundancy 4.1.1.2 Subsystem Level Redundancy 4.1.1.3 System Kernel Level Redundancy 4.1.1.4 Independently Recoverable System Kernels 4.1.1.5 Network Level Redundancy 4.1.1.6 Operating System Performance 4.1.1.7 Application Software Performance 4.1.1.8 Personnel Productivity and System Performance 4.1.2 Implementation Tradeoffs 4.1.2.1 Feasibility Considerations 4.1.2.2 Timing Considerations 4.1.2.3 Learning Curve Considerations 4.1.3 Staffing Tradeoffs and Considerations 4.1.4 Vendor Tradeoffs 4.1.4.1 Reliable Products 4.1.4.2 Adequate Coverage 4.1.4.3 Committed Response 4.1.4.4 Location Flexibility 4.1.4.5 Flexible Roles 4.1.4.6 One-Stop Service 4.1.5 Cost Tradeoffs 5 Maintaining a Dependable Environment 5.1 Electrical Power 5.2 Air Conditioning 5.3 Water Supplies 5.4 Site Security 5.5 Desktop Environments 5.6 Dealing with Personnel 5.6.1 Comprehensive Training 5.6.2 Suitable Tools 5.6.3 Order and Neatness Contributing to Safety 5.6.4 People-Proof Covers 5.6.5 Operational Zones 5.6.6 Lights Out Computing 5.7 Coping with Disasters 5.7.1 Time Domain Considerations 5.7.2 Hot Standby Sites 5.7.3 Business Considerations 5.8 Overall System Considerations 6 Dependable Hardware Configurations 6.1 Eliminating Single Points of Failure 6.2 Conventional VAX Systems 6.2.1 Dependability Enhancements to a Sample Configuration 6.3 Fault Tolerant VAXft Systems 6.4 MIRA AS-Application Switch Systems 6.5 VAXcluster Hardware Topologies 6.5.1 Ethernet Interconnect (IEEE 802.3) 6.5.2 Digital Storage Systems Interconnect (DSSI) 6.5.3 Computer Interconnect (CI) 6.5.4 Fiber Distributed Data Interface (FDDI) 6.5.5 Mixed Interconnect VAXclusters 6.6 Dependability Characteristics Summary 7 Dependability Characteristics of Communications Networks 7.1 Degrees of Protections from Networking Faults 7.2 Providing Multiple Paths to Ethernet and FDDI Segments 7.2.1 Recommendations for High Availability of Local Area VAXclusters 7.2.2 Sample Local Area VAXcluster Configurations with Multiadapter Connections to LAN Segments 7.2.3 Ethernet and FDDI Options 7.2.4 Allowing for LAN Bridge Failover 7.2.5 Adjusting LRPSIZE for Configurations That Include FDDI Nodes 7.2.6 Alternate Adapter Booting for Satellite Nodes 7.2.7 Changing the LAN Address in the DECnet Database to Allow a Cluster Satellite to Boot with Any Adapter 7.3 Troubleshooting with the VAXcluster Network Failure Analysis Program 7.3.1 Summary of Using the Failure Analysis Program 7.3.2 Summary of Subroutine Package 7.4 Network Configurations Using FDDI as the VAXcluster Interconnect 7.5 Providing Multiple WAN Connections for VAXft Systems 7.6 Using a DECnet Cluster Alias to Promote Network Application Availability 7.7 Using DNS to Support Network Dependability 7.8 Using DFS for Transparent Network File Access 7.9 Proactive Network Monitoring and Analysis Products 8 Building Dependable Software Applications 8.1 VMS Dependability Features 8.1.1 VMS Support of Redundant Functional Units 8.1.2 VAXcluster Application Environment Topologies 8.1.2.1 Application Scaling Considerations 8.1.2.2 Resource Contention Considerations 8.1.2.3 Independent Processes Paradigm 8.1.2.4 Distributed Arbitration Paradigm 8.1.2.5 Client/Server Computing Paradigm 8.1.2.6 Synchronization Techniques 8.1.3 DECdtm Services and Two-Phase Commit Protocol 8.1.3.1 Characteristics of Distributed Transactions 8.1.3.2 Transaction Processing System Model 8.1.3.2.1 Resource Manager 8.1.3.2.2 Transaction Manager 8.1.3.2.3 Log Manager 8.1.4 Disk Defragmentation Applications 8.1.5 RMS Journaling 8.1.6 VMS Queue Manager Failover Capabilities 8.2 Writing Predictable, Dependable Code 8.2.1 Avoiding Errors During Software Specification 8.2.2 Avoiding Errors in Design and Implementation 8.2.3 Predicting Future Software Requirements 8.2.4 Surviving External Failures 8.2.5 Providing for Software Evolution 8.2.6 Managing Systems Integration 8.3 Prototyping Applications to Build in Dependability 8.4 Testing Applications to Verify Application Dependability 8.5 Dependability Features of Application Software 8.5.1 Building Dependable Database Applications 8.5.1.1 Database Failover in VAXclusters 8.5.1.1.1 How Rdb/VMS Databases Work in VAXclusters 8.5.1.1.2 Where to Place Rdb/VMS Files in VAXclusters 8.5.1.1.3 Minimizing Impact of Component Failure on Database Access 8.5.1.2 Backing Up Active Rdb/VMS Databases 8.5.1.2.1 Rdb/VMS Online Backup Operation 8.5.1.2.2 Large Database Backup Strategies 8.5.1.3 Rdb/VMS Use of Two-Phase Commit Protocol 8.5.1.4 Automatic Cleanup of Rdb/VMS Databases 8.5.1.5 Online Restructuring of Database Characteristics and Definitions 8.5.1.6 Database Security Impact on Application Dependability 8.5.1.6.1 Rdb/VMS Access Control for C2 Security 8.5.1.6.2 Rdb/VMS Auditing Capabilities 8.5.1.7 Using DECtrace and RdbExpert with Rdb/VMS Applications 8.5.2 Dependability Aspects of Application Form and Function 8.5.3 Dependability Characteristics of Transaction Processing Monitors 8.5.3.1 How TP Monitors Can Assist Dependability Goals of Production Systems 8.5.3.2 How VAX ACMS Can Assist the Dependability of Production Systems 8.5.3.2.1 ACMS Balances Process Pools 8.5.3.2.2 ACMS Fails Over Applications 8.5.3.2.3 ACMS Provides Automatic Front-End Terminal Failover 8.5.3.2.4 ACMS Uses Queues to Capture User Requests 8.6 Managing Shared Information with Software Tools 8.6.1 Using the Digital COHESION Environment 8.6.2 Defining Symbols and Logical Names 8.6.3 Using the DNS Namespace 9 Dependable Data Center Techniques 9.1 Managing Complex Computing Environments 9.2 Data Center Operations 9.2.1 Using DCL Procedures to Minimize User Error 9.2.2 Scheduling Preventative Maintenance 9.3 Failures and Recovery 9.3.1 Catastrophic Failures 9.3.2 Intermittent Failures 9.3.3 Multiple-Cause Failures 9.3.4 False Failures 9.4 Upgrades and Installations 9.4.1 Continuing Service to Users During Upgrades and Installations 9.4.2 Controlling Quotas and Privileges 9.5 Backup Procedures 9.6 Dependable Disk Devices 9.6.1 Restoring Disk Devices Containing Databases 9.6.2 Defragmenting Disks to Improve I/O Performance 9.7 VMS Batch and Print Recovery Techniques 9.8 Implementing the Security Policy of the Data Center 9.9 Supporting a Distributed Environment 9.10 Supporting VAXcluster System Environments 9.10.1 VAXcluster Quorum Disk 9.10.2 VAXcluster Common System Disks 9.10.3 Multiple VMS Versions (Rolling Upgrade) 10 Dependable Services from Digital 10.1 Application Characterization and Predictive Sizing Services 10.2 Capacity Planning Service 10.3 COHESION Support Services 10.4 Contingency Planning Assistance Service 10.5 Customer Training Advice Package 10.5.1 Course Formats 10.5.2 Purchase Options 10.5.3 Comprehensive Training Solutions 10.6 DECstart Services 10.7 Digital Program Methodology Services 10.8 DSNlink-Customer Access to Existing Knowledge Databases 10.9 Enterprise Integration Centers Advice Package 10.10 Enterprise Planning and Design Services 10.11 Help Desk Service 10.12 Migration Services 10.13 Network Performance Consulting Services 10.14 Packaged Application Software Solution Service 10.15 Professional Services 10.16 Recover-All Service 10.17 RESTART Service 10.18 Systems Integration Advice Package 10.19 VAX Performance and Capacity Services 10.20 VMS Security Enhancement Service 10.21 VMS Security Review Service 11 Case Study: Lights Out Data Center 11.1 The Customer Support Center Business 11.2 Time for a Radical Change 11.3 Customer Expectations 11.4 Strategies for Achieving 100% Application Availability 11.4.1 Process Strategy to Meet CSC Business Needs 11.4.2 Staffing Strategy to Meet CSC Business Needs 11.4.3 Hardware Strategy to Meet CSC Business Needs 11.4.4 Software Strategy to Meet CSC Business Needs 11.4.5 Environment Strategy to Meet CSC Business Needs 11.4.6 Telecommunications Strategy to Meet CSC Business Needs 11.5 Implementing the Data Center's Strategies 11.5.1 Sensitive Implementation of a Refocused Staff 11.5.2 Technology Planning and Utilization 11.5.3 Protecting Against Environmental Factors 11.5.4 Application Development Management and Implementation 11.5.5 Overall Operations Support Implementation 11.6 DECalert and Other Products Used to Manage CSC Data Center Operations 11.7 Additional Benefits of the Lights Out Environment A Data Center Evaluation Checklists A.1 General Planning Checklist A.2 Environmental Management Checklist A.3 Data Center Organization Checklist A.4 Security Checklist A.5 Application Software Checklist A.6 Digital Service and Support Checklist A.7 Compliance Summary B ALL-IN-1 System Monitoring Checklist B.1 Monitoring Electronic Messaging B.1.1 Monitoring the Size of the Sender Queue B.1.1.1 How to Check for Deferred Messages B.1.1.2 Recommendations B.1.2 Monitoring the Size of the Fetcher Queue B.1.2.1 How to Check the ALL-IN-1 Fetcher Queue B.1.2.2 How to Check the Message Router ALL-IN-1 Mailbox B.1.2.3 Recommendations B.1.3 Checking the Mail Log Files B.1.3.1 Recommendations B.1.4 Checking Message Router Links B.1.4.1 How to Check if Message Router Is Available B.1.4.2 How to Check Message Router on a Remote Node B.1.4.3 Recommendations B.1.5 Monitoring the Size of the Mail Areas B.1.5.1 Recommendations B.2 Checking That VMS Setup Is Appropriate for Your ALL-IN-1 System B.2.1 Checking the Modes of Logical Names B.2.1.1 Recommendations B.2.2 Checking SYSGEN Parameters B.2.2.1 How to Calculate the Value of GBLSECTIONS B.2.2.2 How to Calculate the Value of GBLPAGES B.2.2.3 How to Calculate the Value of GBLPAGFIL B.2.2.4 Recommendations B.2.3 Checking SYSUAF Parameters B.2.3.1 Recommendations B.2.4 Checking the Protections on Major Files B.2.4.1 Recommendations B.3 Ensuring the Integrity of an ALL-IN-1 System B.3.1 Testing and Repairing Users' File Cabinets B.3.2 Backing Up ALL-IN-1 B.3.3 Recommendations C Bibliography C.1 Digital Publications C.2 Other Publications Glossary 24x365 . . . disaster tolerant computing disk fragmentation . . . front-end hardware-based fault tolerance . . . redundant reliability . . . Y-connector zone . . . zone FIGURES 1-1 Operational Conditions Metaphor 1-2 Characteristics of Dependable Computing Systems 1-3 Metaphor of Selected Automotive Components and Dependability Strategies 1-4 Building Blocks of Dependable Systems 2-1 Continuous Improvement Process 2-2 Worksheet for Collecting Dependability Requirements 2-3 Worksheet for Brainstorming First Steps 4-1 A System with Independently Recoverable Kernels 4-2 Individual Kernel Capacity Versus Time: Example State Diagrams 4-3 Total System Capacity Versus Time: Example State Diagram 5-1 Time Domain Behavior of VAX System Configurations 6-1 VAX System Configuration Enhancements Worksheet 6-2 Conventional VAX System: Example Configuration 6-3 Fault Tolerant VAX System: Example Configuration 6-4 MIRA AS: Example Configuration 6-5 Local Area (Ethernet) VAXcluster System: Example Configuration 6-6 DSSI VAXcluster System: Example Configuration 6-7 CI VAXcluster System: Example Configuration 6-8 FDDI VAXcluster System: Example Configuration 6-9 Wide Area Network: Example Configuration 7-1 Sample Configuration for a Local Area VAXcluster Connected to Two LAN Segments 7-2 Sample Configuration for Local Area Cluster Systems Connected to Three LAN Segments 7-3 FDDI in Conjunction with Ethernet in a VAXcluster System 7-4 Multiple-Site Data Center VAXcluster System 7-5 A DNS Namespace 8-1 Two-Phase Commit Protocol for a Distributed Transaction 8-2 Sample Debit/Credit Transaction Execution 8-3 Participants in a Distributed Transaction Example 8-4 Failover and Recovery Process for Rdb/VMS Users in VAXcluster 8-5 Coordination of Rdb/VMS Online Backups 8-6 Two-Phase Commit for Funds Transfer Example 8-7 ACMS Application Failover 8-8 ACMS Front-End Terminal Failover 8-9 ACMS Request Capture 11-1 CSC Data Center Operations Management 11-2 DECalert Sensors and Alert Notifications 11-3 DIANA Modules and DISPLAY TABLES 1-1 Primary Strategies to Enhance Dependability 1-2 Applying Primary Dependability Strategies to Building Blocks 2-1 Dependability Requirements for the Sample Company 2-2 The Sample Company's Proposed Enhancements 4-1 Dependability Constraints: Sample Questions 6-1 VAX System Configuration Enhancements: Sample Worksheet 6-2 VAXcluster Interconnect Characteristics Summary 6-3 Dependability Characteristics Summary 7-1 Ethernet and FDDI Adapters 9-1 Data Center Management Portfolio A-1 Compliance Summary