L [This is the text of the message I posted to Info-VAX in late April of 1988.J  It's included here for historical perspective about the motivation behind  RAMDRIVER.]          P -------------------------------------------------------------------------------- Folks,  H Before I get started let me apologize about the length -- it's long.  ToN those of you without in depth knowledge of VMS internals let me also apologizeK for the amount of VMS internal mumbo-jumbo this contains.  For those of you N with lots of in depth knowledge of VMS internals let me apologize for glossing over a few points.  O Here's some background on what I've been wasting my "spare" time with lately... H A while ago a friend and I got to talking about DEC's PDDRIVER, which isL essentially a RAM disk driver for VMS.  It was being used for a demo to helpI speed up the login process by putting popular files (LOGINOUT, DCLTABLES, N SYLOGIN, etc.) on it.  We theorized that perhaps PDDRIVER wasn't significantly1 faster than a real disk in certain circumstances.   I DEC wrote PDDRIVER as a way to get standalone backup booted from TK50s in M order to boot VMS.  The reason for this you might have to accept on faith ... I VMS requires a random access device as the system disk, and TK50s are not I random access devices.  So, if you set bit 18 (%x20000) in R5 when you're H booting a VAX, VMB will set things up such that the system disk and loadK device are different -- the stuff will be loaded from TK50s, and the system ) disk will be PDA0 - using DEC's PDDRIVER.   K PDDRIVER differs from standard disk drivers in a few interesting ways.  The L IO$_FORMAT I/O function is used to cause the driver to assign non-paged poolL to be the "blocks" which comprise the "disk".  Normal disk type I/O functionK codes go thru the standard ACP FDT routines in SYSACPFDT, and are queued to M the driver's start I/O routine.  The start I/O routine uses two VMS routines, K IOC$MOVFRUSER and IOC$MOVTOUSER, to accomplish "writes" and "reads".  Those I two routines operate by double-mapping the user's buffer a page at a time L in system virtual address space and moving data a byte a time, remapping theL one page window when the buffer crosses a page boundary.  All of this occursK at driver fork IPL, which happens to match IPL$_SYNCH.  This double-mapping N is necessary because fork processes do not have process context, and therefore! user page tables are not current.   L So, when PDDRIVER is copying data all other driver fork activity stops, yourJ scheduler stops, software timers stop, cluster lock management stops, etc.H since you're spending perhaps significant amounts of time at IPL$_SYNCH.  I It's because of this that response time (e.g. - elapsed time for I/O) may H actually be slower for PDDRIVER than it is for a real disk.  A real diskM driver sets up a DMA transfer and then places itself in a wait state, waiting J for the device to interrupt.  When the device interrupts a fork process isK queued, which is used to do device-dependent post-processing and eventually M hand the IRP off to IOC$IOPOST for device-independent post-processing.  While J the driver is waiting for the interrupt your system is doing other things,L like dealing with other driver fork processes, processing the software timerO interrupts, doing cluster lock management, and processing scheduler interrupts.   L In theory a RAM disk *ought* to be faster than a real disk, since there's noI rotational latency and memory is faster than spinning an oxide plow.  One J obvious way to make a faster RAM disk is to use a MOVC3 instruction ratherM than moving data a byte at a time.  The other obvious way to improve response + is to not spend so much time at IPL$_SYNCH.   M VMS device drivers operate at a variety of IPLs in order to synchronize their N activity at various phases of processing an I/O request.  One of those phases,L FDT processing, happens at IPL$_ASTDEL (2).  Response time could be improvedL if we could do the data transfer in the FDT routines and avoid the start I/ON routine (which must execute at driver fork IPL, almost universally IPL$_SYNCH)L altogether.  This turns out to be next to impossible, since the FDT routinesO for disk devices do lots of processing related to manipulating the file system. J It's best to let DEC do the code for those, and DEC's FDT routines exit by3 queueing the IRP to the driver's start I/O routine.   J So, what you can do in the start I/O routine is to turn around and queue aH kernel-mode AST back to the requestor process.  The kernel mode AST willK execute at IPL$_ASTDEL in the context of the issuing process.  This has two M advantages:  one is that we're no longer at IPL$_SYNCH, and the other is that N we have process context, and process virtual address space is mapped (which isL not true for the start I/O routine, which executes in reduced fork context).B Thus the user's buffer is visible without having to double-map it.  M A further performance optimization you can make once your data transfers take L place without disabling the scheduler is to allow multiple simultaneous readM requests to occur.  The MOVC3 instruction is interruptable, and processes can L compete for the CPU in the "middle" of executing that instruction.  Thus, weK can protect the RAM disk using a MUTEX, which allows multiple read lockers, I and a single write locker.  Write requests are still single-threaded (one A writer at a time), and therefore maintains proper data integrity.   P As you've probably guessed by now I've written a replacement for DEC's PDDRIVER.O It does everything PDDRIVER does except support system booting, which it wasn't M designed to do (that's all DEC's was designed to do).  Mine is supposed to be N a faster disk.  To give you some idea of the results of all of this, I wrote aP program which uses block mode RMS to read and subsequently write 64 block chunksN to a 128 block contiguous file, rewinding when it hits EOF.  I put the file itN operates on on three different devices:  an RA81 connected to an HSC-50, a RAMO disk controlled by DEC's PDDRIVER, and a RAM disk controlled by my driver.  The L program makes 2000 passes through the loop reading and writing blocks.  Here5 are the results for a single job mashing on the file:   W RA81/HSC50: ELAPSED: 0 00:06:06.84  CPU: 0:00:33.54  BUFIO: 2  DIRIO: 5000  FAULTS: 13  V PDDRIVER:   ELAPSED: 0 00:08:47.22  CPU: 0:08:10.49  BUFIO: 3  DIRIO: 5002  FAULTS: 6 V RAMDRIVER:  ELAPSED: 0 00:01:35.64  CPU: 0:01:12.65  BUFIO: 3  DIRIO: 5000  FAULTS: 6   N Since the program uses UPI (user process interlocking), you can start multipleL copies bashing on the same file.  Since this is just a response time test itN doesn't really matter what this does to the file, all that matters is that youO get some idea of the scaling for multiple processes accessing the disk.  I took O the same file, and started 5 copies of the program running on it.  Here are the  results:   RA81/HSC50:   O  ELAPSED:    0 00:20:37.14  CPU: 0:00:32.75  BUFIO: 2  DIRIO: 5000  FAULTS: 13  O  ELAPSED:    0 00:20:40.20  CPU: 0:00:34.49  BUFIO: 2  DIRIO: 5000  FAULTS: 13  O  ELAPSED:    0 00:20:40.82  CPU: 0:00:31.80  BUFIO: 2  DIRIO: 5000  FAULTS: 13  O  ELAPSED:    0 00:20:42.59  CPU: 0:00:33.30  BUFIO: 2  DIRIO: 5000  FAULTS: 13  O  ELAPSED:    0 00:20:39.12  CPU: 0:00:32.14  BUFIO: 2  DIRIO: 5000  FAULTS: 13    	 PDDRIVER:   O  ELAPSED:    0 00:41:39.63  CPU: 0:08:03.59  BUFIO: 3  DIRIO: 5002  FAULTS: 13  O  ELAPSED:    0 00:42:16.22  CPU: 0:08:02.16  BUFIO: 3  DIRIO: 5000  FAULTS: 13  O  ELAPSED:    0 00:43:06.47  CPU: 0:08:03.88  BUFIO: 3  DIRIO: 5000  FAULTS: 13  O  ELAPSED:    0 00:42:29.57  CPU: 0:08:03.66  BUFIO: 3  DIRIO: 5000  FAULTS: 13  O  ELAPSED:    0 00:41:36.60  CPU: 0:08:02.55  BUFIO: 3  DIRIO: 5000  FAULTS: 13    
 RAMDRIVER:  O  ELAPSED:    0 00:05:18.13  CPU: 0:01:12.33  BUFIO: 3  DIRIO: 5002  FAULTS: 13  O  ELAPSED:    0 00:06:03.99  CPU: 0:01:11.81  BUFIO: 3  DIRIO: 5000  FAULTS: 13  O  ELAPSED:    0 00:06:19.82  CPU: 0:01:12.32  BUFIO: 3  DIRIO: 5002  FAULTS: 13  O  ELAPSED:    0 00:05:28.92  CPU: 0:01:11.66  BUFIO: 3  DIRIO: 5000  FAULTS: 13  O  ELAPSED:    0 00:06:06.50  CPU: 0:01:12.75  BUFIO: 3  DIRIO: 5000  FAULTS: 13      M Note that in both cases DEC's RAM disk driver is slower than a real disk, and N when there are multiple accessors, it's *significantly* slower.  These numbersD do not suggest that using PDDRIVER to speed your system up is a win.  L I'm not satisfied that RAMDRIVER is 100% safe yet;  I intend to bash it hereI at SDSC until I'm quite satisfied that it works.  I'll keep y'all posted.    gkn ( ---------------------------------------- Internet: GKN@SDS.SDSC.EDU Bitnet:   GKN@SDSC Span:     SDSC::GKN (27.1) MFEnet:   GKN@SDS  USPS:	  Gerard K. Newman(           San Diego Supercomputer Center           P.O. Box 85608"           San Diego, CA 92138-5608 Phone:    619.534.5076