Document revision date: 19 July 1999
[Compaq] [Go to the documentation home page] [How to order documentation] [Help on this site] [How to contact us]
[OpenVMS documentation]

OpenVMS VAX System Dump Analyzer Utility Manual


Previous Contents Index

8.2.2 Illegal Page Faults

A PGFIPLHI bugcheck occurs when a page fault occurs while the interrupt priority level (IPL) is greater than 2 (IPL$_ASTDEL). When the system fails because of an illegal page fault, the following message appears on the console terminal:


PGFIPLHI, page fault with IPL too high 

When an illegal page fault occurs, the stack appears as shown in Figure SDA-4.

Figure SDA-4 Stack Following an Illegal Page-Fault Error


Six longwords describe the error:
Longword Contents
R4 Contents of R4 at the time of the bugcheck.
R5 Contents of R5 at the time of the bugcheck.
Reason mask Longword mask. If bit 0 of this longword is set, the failing instruction (at the PC saved below) caused a length violation. If bit 1 is set, it referred to a location whose page table entry is in an "access" page. Bit 2 indicates the type of access used by the failing instruction: it is set for write and modify operations and clear for read operations.
Virtual address Virtual address being referenced by the instruction that caused the page fault.
PC PC containing the address of the instruction that caused the page fault.
PSL PSL at the time of the page fault.

If the operating system detects a page fault while the IPL is higher than IPL$_ASTDEL, you can obtain the address of the instruction that caused the fault by examining the PC pushed onto the current operating stack. Follow the steps outlined in Section 9.3 to determine which module issued the instruction.

9 A Sample System Failure

This section steps through the analysis of a system failure using, as an example, a printer driver. Three events lead up to this failure:

  1. The line printer goes off line for 3 hours.
  2. The line printer comes back on line.
  3. The operating system signals a bugcheck, writes information to the system dump file, and shuts itself down.

The following sections describe the actions to take in investigating the causes of this system crash.

9.1 Identifying the Bugcheck

First, invoke SDA to analyze the system dump file. The initialization message indicates the type of bugcheck that occurred as follows:


 
Dump taken on 31-JAN-1993 16:34:31.23 
INVEXCEPTN, Exception while above ASTDEL or on interrupt stack 
 
SDA> 

An exception occurred that caused the system to signal a bugcheck, and signal and mechanism arrays have been created on the current operating stack.

9.2 Identifying the Exception

Use the SHOW STACK command to display the current operating stack. In this case, it is the interrupt stack. The following example shows the interrupt stack and the signal and mechanism arrays. See the SHOW STACK command for a complete description of the format of the stack display.


CPU 01 Processor stack 
---------------------- 
Current operating stack (INTERRUPT) 
 
        8006A378    8000844B    ACP$WRITEBLK+0A0 
   .
   .
   .
  SP => 8006A398    7FFDC340 
        8006A39C    8006A3A0 
        8006A3A0    80004E7D    EXE$REFLECT+0D4 
        8006A3A4    04080009 
        8006A3A8    00000004 
        8006A3AC    7FFDC368 
        8006A3B0    FFFFFFFD 
        8006A3B4    8001774E 
        8006A3B8    0000074F 
        8006A3BC    00000001 
        8006A3C0    00000005 
        8006A3C4    0000000C 
        8006A3C8    00000000 
        8006A3CC    80069E00 
        8006A3D0    8005D003 
        8006A3D4    04080000 
        8006A3D8    80009604    EXE$FORKDSPTH+01C 
   .
   .
   .

The mechanism array begins at address 8006A3A816 and ends at address 8006A3B816. Its first longword contains 0000000416. The signal array begins at address 8006A3C016 and ends at 8006A3D416. Its first longword contains 0000000516 and its second longword contains 0000000C16. Examination of the signal array shows the following:

Issuing the SDA command EVALUATE/PSL 04080000 makes the following information apparent:

Use the SHOW PAGE_TABLE command to display the system page table, as shown in the following example. The page containing location 80069E0016 is not available to any access mode (a null page); thus, the virtual address is not valid.


SDA> SHOW PAGE_TABLE
 
 
System page table
-----------------
  
ADDRESS   SVAPTE   PTE       TYPE  PROT  BITS PAGTYP  LOC STATE TYPE REFCNT   BAK       SVAPTE  FLINK  BLINK
   .
   .
   .
80068400  80777B08 7C40FFC8  STX   UR       K
80068600  80777B0C 7C40FFC8  STX   UR       K
80068800  80777B10 7C40FFC8  STX   UR       K
80068A00  80777B14 7C40FFC8  STX   UR       K
80068C00  80777B18 7C40FFC8  STX   UR       K
80068E00  80777B1C 7C40FFC8  STX   UR       K
80069000  80777B20 7C40FFC8  STX   UR       K
80069200  80777B24 7C40FFC8  STX   UR       K
80069400  80777B28 7C40FFC8  STX   UR       K
80069600  80777B2C 7C40FFC8  STX   UR       K
80069800  80777B30 7C40FFC8  STX   UR       K
80069A00  80777B34 780016C9  TRANS UR       K SYSTEM FREELST 00   01    0   0040FFC8   80777B34  03AF  0E15
80069C00  80777B38 78000E15  TRANS UR       K SYSTEM FREELST 00   01    0   0040FFC8   80777B38  16C9  2592
-------- 40 NULL PAGES
   .
   .
   .

9.3 Locating the Source of the Exception

Because the printer went off line and then came back on line, as shown on the console listing in Section 9.2, the problem might exist in the driver code. SDA can help you determine which driver might contain the faulty code.

9.3.1 Finding the Driver by Using the Program Counter

The first step in determining whether the failing instruction is within a driver is to examine the PC in the signal array using the EXAMINE/INSTRUCTION command. This has two results:

In the following example, the instruction that caused the exception is located within the printer driver.


SDA> EXAMINE/INSTRUCTION 8005D003
LPDRIVER+2B3   MOVB    (R3)+,(R0)

If SDA is unable to find a symbol within FFF16 bytes of the memory location you specify, it displays the location as an absolute address. This often, but not always, means the instruction that caused the exception is not part of a device driver.

To determine whether an instruction is part of a driver, use the SHOW DEVICE command to display the starting addresses and lengths of all the drivers in the system. If the address of the failing instruction falls within the range of addresses shown for a given driver, the failing instruction is a part of that driver. The following example shows a partial list of the drivers in the display generated by the SHOW DEVICE command.


I/O data structures 
 
                           DDB list 
                           -------- 
 
    Address    Controller     ACP       Driver      DPT   DPT size 
    -------    ----------     ---       ------      ---   -------- 
 
    80000ECC    HELIUM$DBA    F11XQP    DBDRIVER   800F7AD0  08FD 
    80001040    OPA                     OPERATOR   80001622  0061 
    8000126C    MBA                     MBDRIVER   800015B0  0578 
    80001460    NLA                     NLDRIVER   800015E9  05A3 
    801E2800    HELIUM$DMA    F11XQP    DMDRIVER   800B5CB0  0AA0 
    801E2980    HELIUM$DLA    F11XQP    DLDRIVER   800B6A50  08D0 
   .
   .
   .

9.3.2 Calculating the Offset into the Driver's Program Section

The offsets that SDA displays from nnDRIVER are actually offsets from the DPT. As such, these offsets do not exactly correspond to the offsets shown in driver listings, which represent offsets from the beginning of the program section (PSECT) in which a given instruction appears. Because a driver usually contains more than one PSECT, you must use the driver's map to determine the location of the failing instruction within the driver listing.

To calculate the location of the instruction within the driver listing, refer to the "Program Section Synopsis" section of the driver's map. Determine in which PSECT the offset given by SDA occurs and subtract the base of the PSECT from the offset. You can then use the resulting figure as an index into the driver listing.

If SDA does not display the address as an offset from nnDRIVER, but the address is within the address range of a driver in the SHOW DEVICE display, you must first subtract the address of the DPT from the failing address. Using the result as the offset, you can then follow the steps previously outlined for determining the index of the instruction into a driver listing.

9.4 Finding the Problem Within the Routine

To find the problem within the routine, examine the printer's driver code. In the system failure discussed in this example, the instruction that caused the exception is MOVB (R3)+,(R0). To check the contents of R3, use the EXAMINE command as follows:


SDA> EXAMINE R3
R3: 80069E00 "...."

The invalid virtual address, as recorded in the signal array, is stored in R3. In the following driver code excerpt, the instruction in question appears at line 599. It is likely that the contents of R3 have been incremented too many times.


581 STARTIO: 
582      MOVL    UCB$L_IRP(R5),R3     ;Retrieve address of I/O packet 
583      MOVW    IRP$L_MEDIA+2(R3),- 
584              UCB$W_BOFF(R5)       ;Set number of characters to print 
585      MOVL    UCB$L_SVAPTE(R5),R3  ;Get address of system buffer 
586      MOVAB   12(R3),R3            ;Get address of data area 
587      MOVL    UCB$L_CRB(R5),R4     ;Get address of CRB 
588      MOVL    @CRB$L_INTD+VEC$L_IDB(R4),R4 ;Get device CSR address 
589 ; 
590 ; START NEXT OUTPUT SEQUENCE 
591 ; 
592 
593 10$: ADDL3   #LP_DBR,R4,R0        ;Calculate address of data buffer register 
594      MOVZWL  UCB$W_BOFF(R5),R1    ;Get number of characters remaining 
595      MOVW    #^X8080,R2           ;Get control register test mask 
596      BRB     25$                  ;Start output 
597 20$: BITW    R2,(R4) (1)           ;Printer ready or have paper problem? 
598      BLEQ    30$                  ;If LEQ not ready or paper problem 
599      MOVB    (R3)+,(R0) (2)        ;Output next character 
600      ASHL    #1,G^EXE$GL_UBDELAY,-(SP)    ;Delay 3*2 u-seconds 
601 24$: SOBGEQ  (SP),24$             ;Delay loop calibrated to machine speed 
602      ADDL    #4,SP                ;Pop extra longword off stack 
603 25$: SOBGEQ  R1,20$ (3)            ;Any more characters to output? 
604      BRW     70$                  ;All done, BRW to set return status 

Explanations of the circled numbers in the example are in Section 9.4.1.

9.4.1 Examining the Routine

The MOVB instruction is part of a routine that reads characters from a buffer and writes them to the printer. The routine contains the loop of instructions that starts at the label 20$ and ends at 25$. This loop executes once for each character in the buffer, performing these steps:

  1. The driver checks the printer's status register to see if the printer is ready.
  2. If the printer is ready, the driver gets a character from the buffer and moves it to the printer's data register, to which R0 points.
  3. It then decrements R1, which contains the count of characters left to print. If R1 contains a number greater than 0, control is passed back to the instruction at 20$, and the loop begins again.

Steps 1 and 2 are repeated until the contents of R1 are 0 or the printer signals that it is not ready.

If the printer signals that it is not ready, the driver transfers control to 30$ (line 598), the beginning of a routine that waits for an interrupt from the printer. When the printer becomes ready, it interrupts the driver and execution of the loop resumes.

Examine the code to determine which variables control the loop.

The byte count (BCNT) is the number of characters in the buffer. Note that BCNT is set by a function decision table (FDT) routine and that this routine sets the value of BCNT to the number of characters in the buffer. In line 586, the starting address of a buffer that is BCNT bytes in size is moved into R3.

Note also that the number of characters left to be printed is represented by the byte offset (BOFF), the offset into the buffer at which the driver finds the next character to be printed. This value controls the number of times the loop is executed.

Because the exception is an access violation, either R3 or R0 must contain an incorrect value. You can determine that R0 is probably valid by the following logic:

Thus, the contents of R3 seem to be the cause of the failure.

The most likely reason that the contents of R3 are wrong is that the MOVB instruction at line 599 executes too many times. You can check this by comparing the contents of UCB$W_BOFF and UCB$W_BCNT. If UCB$W_BOFF contains a larger value than that in UCB$W_BCNT, then R3 contains a value that is too large, indicating that the MOVB instruction has incremented the contents of R3 too many times.

9.4.2 Checking the Values of Key Variables

Because the start-I/O routine requires that R5 contain the address of the printer's UCB, and because several other instructions reference R5 without error before any instruction in the loop does, you can assume that R5 contains the address of the right UCB. To compare BOFF and BCNT, use the command FORMAT @R5 to display the contents of the UCB, as shown in the following session.


SDA> READ SYS$SYSTEM:SYSDEF.STB
SDA> FORMAT @R5


8005D160    UCB$L_FQFL      800039A8 
            UCB$L_RQFL 
            UCB$W_MB_SEED 
            UCB$W_UNIT_SEED 
8005D164    UCB$L_FQBL      800039A8 
            UCB$L_RQBL 
8005D168    UCB$W_SIZE          0122 
8005D16A    UCB$B_TYPE        10 
8005D16B    UCB$B_FIPL      34 
            UCB$B_FLCK 
   .
   .
   .
8005D1C8    UCB$L_SVAPTE    80062720 
8005D1CC    UCB$W_BOFF          0795 
8005D1CE    UCB$W_BCNT      006D 
8005D1D0    UCB$B_ERTCNT          00 
8005D1D1    UCB$B_ERTMAX        00 
8005D1D2    UCB$W_ERRCNT    0000 
   .
   .
   .
SDA> 

If you have only one printer in your system configuration, you do not need to use the FORMAT command. Instead, you can use the command SHOW DEVICE LP. Because only one printer is connected to the processor, only one UCB is associated with a printer for SDA to display.

The output produced by the FORMAT @R5 command shows that UCB$W_BOFF contains a value greater than that in UCB$W_BCNT; it should be smaller. Therefore, the value stored in BOFF is incorrect.

Thus, the value of BOFF is not the number of characters that remain in the buffer. This value is used in calculating an address that is referenced at an elevated IPL. When this address is within a null page (unreadable in all access modes), an attempt to reference it causes the system to fail.

9.4.3 Identifying and Correcting the Defective Code

Examine the printer driver code to locate all instructions that modify UCB$W_BOFF. The value changes in two circumstances:

When the printer times out, the driver should not modify UCB$W_BOFF. It does so, however, in line 631. The driver should modify the contents of UCB$W_BOFF only when it is certain that the printer printed the character. When the printer times out, this is not the case. Furthermore, the wait-for-interrupt routine preserves only registers R3, R4, and R5, so that only those registers can be used unmodified after the execution of the wait-for-interrupt routine. Thus, the use of R1 in line 631 is an error.

To correct the problem, change the WFIKPCH argument (line 616) so that, when the printer times out, the WFIKPCH macro transfers control to 50$ rather than to 40$.


607 
608 30$: BNEQ    40$                  ;If NEQ paper problem 
609      ADDW3   #1,R1,UCB$W_BOFF(R5) ;Save number of characters remaining 
610      DEVICELOCK - 
611              LOCKADDR=UCB$L_DLCK(R5),-  ;Lock device interrupts 
612              SAVIPL=-(SP)         ;Save current IPL      
613      BITW    #^X80,LP_CSR(R4)     ;Is it ready now? 
614      BNEQ    35$                  ;If NEQ, yes, it's ready 
615      BISB    #^X40,LP_CSR(R4)     ;Set interrupt enable 
616      WFIKPCH 40$,#12              ;Wait for ready interrupt 
617      IOFORK                       ;Create a fork process 
618      BRB     10$                  ;  ...and start next output 
619 
620 35$: 
621      DEVICEUNLOCK - 
622              LOCKADDR=UCB$L_DLCK(R5),-  ;Unlock device interrupts 
623              NEWIPL=(SP)+         ;Restore IPL 
624      CLRW    LP_CSR(R4)           ;Disable device interrupts 
625      BRB     10$                  ;Go transfer more characters 
626 ; 
627 ; PRINTER HAS PAPER PROBLEM 
628 ; 
629 
630 40$: CLRL    UCB$L_LP_OFLCNT(R5)  ;Clear offline counter 
631      ADDW3   #1,R1,UCB$W_BOFF(R5) ;Save number of characters remaining 
632 50$: CLRW    LP_CSR(R4)           ;Disable printer interrupt 
633      IOFORK                       ;Lower to fork level 
634      BBS     #UCB$V_CANCEL,UCB$W_STS(R5),80$  ;If set, cancel I/O operation 
635      TSTW    LP_CSR(R4)           ;Printer still have paper problem? 
636      BLSS    55$                  ;If LSS yes 
637      MOVL    #15,UCB$L_LP_TIMEOUT(R5)  ;Set timeout value 
638      BRB     10$                  ; ...and start next output 

10 Inducing a System Failure

If the operating system is not performing well and you want to create a dump you can examine, you must induce a system failure. Occasionally, a device driver or other user-written, kernel-mode code can cause the system to execute a loop of code at a high priority, interfering with normal system operation. This can occur even though you have set a breakpoint in the code if the loop is encountered before the breakpoint. To gain control of the system in such circumstances, you must cause the system to fail and then reboot it.

If the system has suspended all noticeable activity (if it is "hung"), see the examples of causing system failures in Section 10.2.

If you are generating a system crash in response to a system hang, be sure to record the PC at the time of the system halt as well as the contents of the general registers. Submit this information to Digital, along with the Software Performance Report (SPR) and a copy of the generated system dump file.

10.1 Meeting Crash Dump Requirements

The following requirements must be met before the system can write a complete crash dump:


Previous Next Contents Index

  [Go to the documentation home page] [How to order documentation] [Help on this site] [How to contact us]  
  privacy and legal statement  
4556PRO_003.HTML