HyperReader ... Building Dependable Systems: The VMS Approach

  CONTENTS

  Title Page

  Copyright Page

  Preface

  1      Introduction to Dependable Computing

  1.1     Levels of System Dependability Requirements

  1.2     How Do You Build Dependable Systems?

  1.3     Dependability Terms

  1.4     Basic Concepts of Dependability
    1.4.1      Primary Dependability Strategies
    1.4.2      Redundant Functional Units
      1.4.2.1      Manual Redundancy
      1.4.2.2      Automatic Redundancy
      1.4.2.3      Capacity-Related Redundancy
      1.4.2.4      Redundancy in Computing Systems

  1.5     How a Dependable System Supports the Business

  1.6     Balance Is Critical to a Dependable System

  2      Analyzing Dependable System Requirements

  2.1     Dependability Is a Journey, Not a Destination

  2.2     Determining Your Dependability Requirements
    2.2.1      Collecting Requirements
    2.2.2      Priority Requirements
    2.2.3      Analyzing the Sample Company's Dependability Requirements
    2.2.4      Generating the Sample Company's First Steps to Dependability
    2.2.5      Mapping Options to Requirements for the Sample Company

  3      Dependability Options of the System Building Blocks

  3.1     Analyzing Environmental Options
    3.1.1      Utilities
    3.1.2      Structures
    3.1.3      Networks

  3.2     Analyzing Hardware Options
    3.2.1      Starting with New Equipment
    3.2.2      Making Do with Currently Installed Equipment
    3.2.3      Investing a Little to Gain a Lot

  3.3     Analyzing Communications Options
    3.3.1      Restricting Access to the Computer Room
    3.3.2      Providing Corporate Data to Local Personal Computers
    3.3.3      Weaving a Tapestry of Computing Resources
    3.3.4      Providing a System Network

  3.4     Analyzing Software Options
    3.4.1      Writing Custom Applications
    3.4.2      Acquiring Software Packages
    3.4.3      Selecting a Systems Integrator

  3.5     Analyzing Operational Procedures Options
    3.5.1      Avoiding User Errors
    3.5.2      Training, Testing, and Drills
    3.5.3      Extended Hours of Operator Coverage
    3.5.4      Beepers for System Managers and Programmers
    3.5.5      Lights Out Computer Facilities
    3.5.6      Policies Regarding System Privileges

  3.6     Analyzing Personnel Options
    3.6.1      Teamwork Makes the System Work
    3.6.2      Robots, We Are Not!

  4      Balancing Dependability with Other Business Considerations

  4.1     Identifying Constraints to Achieving a Dependable System
    4.1.1      Performance Tradeoffs
      4.1.1.1      Circuit Level Redundancy
      4.1.1.2      Subsystem Level Redundancy
      4.1.1.3      System Kernel Level Redundancy
      4.1.1.4      Independently Recoverable System Kernels
      4.1.1.5      Network Level Redundancy
      4.1.1.6      Operating System Performance
      4.1.1.7      Application Software Performance
      4.1.1.8      Personnel Productivity and System Performance
    4.1.2      Implementation Tradeoffs
      4.1.2.1      Feasibility Considerations
      4.1.2.2      Timing Considerations
      4.1.2.3      Learning Curve Considerations
    4.1.3      Staffing Tradeoffs and Considerations
    4.1.4      Vendor Tradeoffs
      4.1.4.1      Reliable Products
      4.1.4.2      Adequate Coverage
      4.1.4.3      Committed Response
      4.1.4.4      Location Flexibility
      4.1.4.5      Flexible Roles
      4.1.4.6      One-Stop Service
    4.1.5      Cost Tradeoffs

  5      Maintaining a Dependable Environment

  5.1     Electrical Power

  5.2     Air Conditioning

  5.3     Water Supplies

  5.4     Site Security

  5.5     Desktop Environments

  5.6     Dealing with Personnel
    5.6.1      Comprehensive Training
    5.6.2      Suitable Tools
    5.6.3      Order and Neatness Contributing to Safety
    5.6.4      People-Proof Covers
    5.6.5      Operational Zones
    5.6.6      Lights Out Computing

  5.7     Coping with Disasters
    5.7.1      Time Domain Considerations
    5.7.2      Hot Standby Sites
    5.7.3      Business Considerations

  5.8     Overall System Considerations

  6      Dependable Hardware Configurations

  6.1     Eliminating Single Points of Failure

  6.2     Conventional VAX Systems
    6.2.1      Dependability Enhancements to a Sample Configuration

  6.3     Fault Tolerant VAXft Systems

  6.4     MIRA AS-Application Switch Systems

  6.5     VAXcluster Hardware Topologies
    6.5.1      Ethernet Interconnect (IEEE 802.3)
    6.5.2      Digital Storage Systems Interconnect (DSSI)
    6.5.3      Computer Interconnect (CI)
    6.5.4      Fiber Distributed Data Interface (FDDI)
    6.5.5      Mixed Interconnect VAXclusters

  6.6     Dependability Characteristics Summary

  7      Dependability Characteristics of Communications Networks

  7.1     Degrees of Protections from Networking Faults

  7.2     Providing Multiple Paths to Ethernet and FDDI Segments
    7.2.1      Recommendations for High Availability of Local Area VAXclusters
    7.2.2      Sample Local Area VAXcluster Configurations with Multiadapter Connections to LAN Segments
    7.2.3      Ethernet and FDDI Options
    7.2.4      Allowing for LAN Bridge Failover
    7.2.5      Adjusting LRPSIZE for Configurations That Include FDDI Nodes
    7.2.6      Alternate Adapter Booting for Satellite Nodes
    7.2.7      Changing the LAN Address in the DECnet Database to Allow a Cluster Satellite to Boot with Any Adapter

  7.3     Troubleshooting with the VAXcluster Network Failure Analysis Program
    7.3.1      Summary of Using the Failure Analysis Program
    7.3.2      Summary of Subroutine Package

  7.4     Network Configurations Using FDDI as the VAXcluster Interconnect

  7.5     Providing Multiple WAN Connections for VAXft Systems

  7.6     Using a DECnet Cluster Alias to Promote Network Application Availability

  7.7     Using DNS to Support Network Dependability

  7.8     Using DFS for Transparent Network File Access

  7.9     Proactive Network Monitoring and Analysis Products

  8      Building Dependable Software Applications

  8.1     VMS Dependability Features
    8.1.1      VMS Support of Redundant Functional Units
    8.1.2      VAXcluster Application Environment Topologies
      8.1.2.1      Application Scaling Considerations
      8.1.2.2      Resource Contention Considerations
      8.1.2.3      Independent Processes Paradigm
      8.1.2.4      Distributed Arbitration Paradigm
      8.1.2.5      Client/Server Computing Paradigm
      8.1.2.6      Synchronization Techniques
    8.1.3      DECdtm Services and Two-Phase Commit Protocol
      8.1.3.1      Characteristics of Distributed Transactions
      8.1.3.2      Transaction Processing System Model
        8.1.3.2.1       Resource Manager
        8.1.3.2.2       Transaction Manager
        8.1.3.2.3       Log Manager
    8.1.4      Disk Defragmentation Applications
    8.1.5      RMS Journaling
    8.1.6      VMS Queue Manager Failover Capabilities

  8.2     Writing Predictable, Dependable Code
    8.2.1      Avoiding Errors During Software Specification
    8.2.2      Avoiding Errors in Design and Implementation
    8.2.3      Predicting Future Software Requirements
    8.2.4      Surviving External Failures
    8.2.5      Providing for Software Evolution
    8.2.6      Managing Systems Integration

  8.3     Prototyping Applications to Build in Dependability

  8.4     Testing Applications to Verify Application Dependability

  8.5     Dependability Features of Application Software
    8.5.1      Building Dependable Database Applications
      8.5.1.1      Database Failover in VAXclusters
        8.5.1.1.1       How Rdb/VMS Databases Work in VAXclusters
        8.5.1.1.2       Where to Place Rdb/VMS Files in VAXclusters
        8.5.1.1.3       Minimizing Impact of Component Failure on Database Access
      8.5.1.2      Backing Up Active Rdb/VMS Databases
        8.5.1.2.1       Rdb/VMS Online Backup Operation
        8.5.1.2.2       Large Database Backup Strategies
      8.5.1.3      Rdb/VMS Use of Two-Phase Commit Protocol
      8.5.1.4      Automatic Cleanup of Rdb/VMS Databases
      8.5.1.5      Online Restructuring of Database Characteristics and Definitions
      8.5.1.6      Database Security Impact on Application Dependability
        8.5.1.6.1       Rdb/VMS Access Control for C2 Security
        8.5.1.6.2       Rdb/VMS Auditing Capabilities
      8.5.1.7      Using DECtrace and RdbExpert with Rdb/VMS Applications
    8.5.2      Dependability Aspects of Application Form and Function
    8.5.3      Dependability Characteristics of Transaction Processing Monitors
      8.5.3.1      How TP Monitors Can Assist Dependability Goals of Production Systems
      8.5.3.2      How VAX ACMS Can Assist the Dependability of Production Systems
        8.5.3.2.1       ACMS Balances Process Pools
        8.5.3.2.2       ACMS Fails Over Applications
        8.5.3.2.3       ACMS Provides Automatic Front-End Terminal Failover
        8.5.3.2.4       ACMS Uses Queues to Capture User Requests

  8.6     Managing Shared Information with Software Tools
    8.6.1      Using the Digital COHESION Environment
    8.6.2      Defining Symbols and Logical Names
    8.6.3      Using the DNS Namespace

  9      Dependable Data Center Techniques

  9.1     Managing Complex Computing Environments

  9.2     Data Center Operations
    9.2.1      Using DCL Procedures to Minimize User Error
    9.2.2      Scheduling Preventative Maintenance

  9.3     Failures and Recovery
    9.3.1      Catastrophic Failures
    9.3.2      Intermittent Failures
    9.3.3      Multiple-Cause Failures
    9.3.4      False Failures

  9.4     Upgrades and Installations
    9.4.1      Continuing Service to Users During Upgrades and Installations
    9.4.2      Controlling Quotas and Privileges

  9.5     Backup Procedures

  9.6     Dependable Disk Devices
    9.6.1      Restoring Disk Devices Containing Databases
    9.6.2      Defragmenting Disks to Improve I/O Performance

  9.7     VMS Batch and Print Recovery Techniques

  9.8     Implementing the Security Policy of the Data Center

  9.9     Supporting a Distributed Environment

  9.10    Supporting VAXcluster System Environments
    9.10.1     VAXcluster Quorum Disk
    9.10.2     VAXcluster Common System Disks
    9.10.3     Multiple VMS Versions (Rolling Upgrade)

  10     Dependable Services from Digital

  10.1    Application Characterization and Predictive Sizing Services

  10.2    Capacity Planning Service

  10.3    COHESION Support Services

  10.4    Contingency Planning Assistance Service

  10.5    Customer Training Advice Package
    10.5.1     Course Formats
    10.5.2     Purchase Options
    10.5.3     Comprehensive Training Solutions

  10.6    DECstart Services

  10.7    Digital Program Methodology Services

  10.8    DSNlink-Customer Access to Existing Knowledge Databases

  10.9    Enterprise Integration Centers Advice Package

  10.10   Enterprise Planning and Design Services

  10.11   Help Desk Service

  10.12   Migration Services

  10.13   Network Performance Consulting Services

  10.14   Packaged Application Software Solution Service

  10.15   Professional Services

  10.16   Recover-All Service

  10.17   RESTART Service

  10.18   Systems Integration Advice Package

  10.19   VAX Performance and Capacity Services

  10.20   VMS Security Enhancement Service

  10.21   VMS Security Review Service

  11     Case Study:  Lights Out Data Center

  11.1    The Customer Support Center Business

  11.2    Time for a Radical Change

  11.3    Customer Expectations

  11.4    Strategies for Achieving 100% Application Availability
    11.4.1     Process Strategy to Meet CSC Business Needs
    11.4.2     Staffing Strategy to Meet CSC Business Needs
    11.4.3     Hardware Strategy to Meet CSC Business Needs
    11.4.4     Software Strategy to Meet CSC Business Needs
    11.4.5     Environment Strategy to Meet CSC Business Needs
    11.4.6     Telecommunications Strategy to Meet CSC Business Needs

  11.5    Implementing the Data Center's Strategies
    11.5.1     Sensitive Implementation of a Refocused Staff
    11.5.2     Technology Planning and Utilization
    11.5.3     Protecting Against Environmental Factors
    11.5.4     Application Development Management and Implementation
    11.5.5     Overall Operations Support Implementation

  11.6    DECalert and Other Products Used to Manage CSC Data Center Operations

  11.7    Additional Benefits of the Lights Out Environment

  A   Data Center Evaluation Checklists

  A.1     General Planning Checklist

  A.2     Environmental Management Checklist

  A.3     Data Center Organization Checklist

  A.4     Security Checklist

  A.5     Application Software Checklist

  A.6     Digital Service and Support Checklist

  A.7     Compliance Summary

  B   ALL-IN-1 System Monitoring Checklist

  B.1     Monitoring Electronic Messaging
    B.1.1      Monitoring the Size of the Sender Queue
      B.1.1.1      How to Check for Deferred Messages
      B.1.1.2      Recommendations
    B.1.2      Monitoring the Size of the Fetcher Queue
      B.1.2.1      How to Check the ALL-IN-1 Fetcher Queue
      B.1.2.2      How to Check the Message Router ALL-IN-1 Mailbox
      B.1.2.3      Recommendations
    B.1.3      Checking the Mail Log Files
      B.1.3.1      Recommendations
    B.1.4      Checking Message Router Links
      B.1.4.1      How to Check if Message Router Is Available
      B.1.4.2      How to Check Message Router on a Remote Node
      B.1.4.3      Recommendations
    B.1.5      Monitoring the Size of the Mail Areas
      B.1.5.1      Recommendations

  B.2     Checking That VMS Setup Is Appropriate for Your ALL-IN-1 System
    B.2.1      Checking the Modes of Logical Names
      B.2.1.1      Recommendations
    B.2.2      Checking SYSGEN Parameters
      B.2.2.1      How to Calculate the Value of GBLSECTIONS
      B.2.2.2      How to Calculate the Value of GBLPAGES
      B.2.2.3      How to Calculate the Value of GBLPAGFIL
      B.2.2.4      Recommendations
    B.2.3      Checking SYSUAF Parameters
      B.2.3.1      Recommendations
    B.2.4      Checking the Protections on Major Files
      B.2.4.1      Recommendations

  B.3     Ensuring the Integrity of an ALL-IN-1 System
    B.3.1      Testing and Repairing Users' File Cabinets
    B.3.2      Backing Up ALL-IN-1
    B.3.3      Recommendations

  C   Bibliography

  C.1     Digital Publications

  C.2     Other Publications

  Glossary
    24x365 . . . disaster tolerant computing
    disk fragmentation . . . front-end
    hardware-based fault tolerance . . . redundant
    reliability . . . Y-connector
    zone . . . zone

  FIGURES

  1-1        Operational Conditions Metaphor

  1-2        Characteristics of Dependable Computing Systems

  1-3        Metaphor of Selected Automotive Components and Dependability Strategies

  1-4        Building Blocks of Dependable Systems

  2-1        Continuous Improvement Process

  2-2        Worksheet for Collecting Dependability Requirements

  2-3        Worksheet for Brainstorming First Steps

  4-1        A System with Independently Recoverable Kernels

  4-2        Individual Kernel Capacity Versus Time:  Example State Diagrams

  4-3        Total System Capacity Versus Time:  Example State Diagram

  5-1        Time Domain Behavior of VAX System Configurations

  6-1        VAX System Configuration Enhancements Worksheet

  6-2        Conventional VAX System: Example Configuration

  6-3        Fault Tolerant VAX System: Example Configuration

  6-4        MIRA AS: Example Configuration

  6-5        Local Area (Ethernet) VAXcluster System:  Example Configuration

  6-6        DSSI VAXcluster System: Example Configuration

  6-7        CI VAXcluster System:  Example Configuration

  6-8        FDDI VAXcluster System: Example Configuration

  6-9        Wide Area Network:  Example Configuration

  7-1        Sample Configuration for a Local Area VAXcluster Connected to Two LAN Segments

  7-2        Sample Configuration for Local Area Cluster Systems Connected to Three LAN Segments

  7-3        FDDI in Conjunction with Ethernet in a VAXcluster System

  7-4        Multiple-Site Data Center VAXcluster System

  7-5        A DNS Namespace

  8-1        Two-Phase Commit Protocol for a Distributed Transaction

  8-2        Sample Debit/Credit Transaction Execution

  8-3        Participants in a Distributed Transaction Example

  8-4        Failover and Recovery Process for Rdb/VMS Users in VAXcluster

  8-5        Coordination of Rdb/VMS Online Backups

  8-6        Two-Phase Commit for Funds Transfer Example

  8-7        ACMS Application Failover

  8-8        ACMS Front-End Terminal Failover

  8-9        ACMS Request Capture

  11-1       CSC Data Center Operations Management

  11-2       DECalert Sensors and Alert Notifications

  11-3       DIANA Modules and DISPLAY

  TABLES

  1-1        Primary Strategies to Enhance Dependability

  1-2        Applying Primary Dependability Strategies to Building Blocks

  2-1        Dependability Requirements for the Sample Company

  2-2        The Sample Company's Proposed Enhancements

  4-1        Dependability Constraints: Sample Questions

  6-1        VAX System Configuration Enhancements:  Sample Worksheet

  6-2        VAXcluster Interconnect Characteristics Summary

  6-3        Dependability Characteristics Summary

  7-1        Ethernet and FDDI Adapters

  9-1        Data Center Management Portfolio

  A-1        Compliance Summary