| Document revision date: 19 July 1999 | |
| ![[Compaq]](../../images/compaq.gif) | ![[Go to the documentation home page]](../../images/buttons/bn_site_home.gif)  ![[How to order documentation]](../../images/buttons/bn_order_docs.gif)  ![[Help on this site]](../../images/buttons/bn_site_help.gif)  ![[How to contact us]](../../images/buttons/bn_comments.gif)  | 
| ![[OpenVMS documentation]](../../images/ovmsdoc_sec_head.gif)  | |
| Previous | Contents | Index | 
A PGFIPLHI bugcheck occurs when a page fault occurs while the interrupt priority level (IPL) is greater than 2 (IPL$_ASTDEL). When the system fails because of an illegal page fault, the following message appears on the console terminal:
| PGFIPLHI, page fault with IPL too high | 
When an illegal page fault occurs, the stack appears as shown in Figure SDA-4.
Figure SDA-4 Stack Following an Illegal Page-Fault Error
 
Six longwords describe the error:
| Longword | Contents | 
|---|---|
| R4 | Contents of R4 at the time of the bugcheck. | 
| R5 | Contents of R5 at the time of the bugcheck. | 
| Reason mask | Longword mask. If bit 0 of this longword is set, the failing instruction (at the PC saved below) caused a length violation. If bit 1 is set, it referred to a location whose page table entry is in an "access" page. Bit 2 indicates the type of access used by the failing instruction: it is set for write and modify operations and clear for read operations. | 
| Virtual address | Virtual address being referenced by the instruction that caused the page fault. | 
| PC | PC containing the address of the instruction that caused the page fault. | 
| PSL | PSL at the time of the page fault. | 
If the operating system detects a page fault while the IPL is higher 
than IPL$_ASTDEL, you can obtain the address of the instruction that 
caused the fault by examining the PC pushed onto the current operating 
stack. Follow the steps outlined in Section 9.3 to determine which 
module issued the instruction.
9 A Sample System Failure
This section steps through the analysis of a system failure using, as an example, a printer driver. Three events lead up to this failure:
The following sections describe the actions to take in investigating 
the causes of this system crash.
9.1 Identifying the Bugcheck
First, invoke SDA to analyze the system dump file. The initialization message indicates the type of bugcheck that occurred as follows:
| Dump taken on 31-JAN-1993 16:34:31.23 INVEXCEPTN, Exception while above ASTDEL or on interrupt stack SDA> | 
An exception occurred that caused the system to signal a bugcheck, and 
signal and mechanism arrays have been created on the current operating 
stack.
9.2 Identifying the Exception
Use the SHOW STACK command to display the current operating stack. In this case, it is the interrupt stack. The following example shows the interrupt stack and the signal and mechanism arrays. See the SHOW STACK command for a complete description of the format of the stack display.
| 
CPU 01 Processor stack 
---------------------- 
Current operating stack (INTERRUPT) 
 
        8006A378    8000844B    ACP$WRITEBLK+0A0 
   .
   .
   .
  SP => 8006A398    7FFDC340 
        8006A39C    8006A3A0 
        8006A3A0    80004E7D    EXE$REFLECT+0D4 
        8006A3A4    04080009 
        8006A3A8    00000004 
        8006A3AC    7FFDC368 
        8006A3B0    FFFFFFFD 
        8006A3B4    8001774E 
        8006A3B8    0000074F 
        8006A3BC    00000001 
        8006A3C0    00000005 
        8006A3C4    0000000C 
        8006A3C8    00000000 
        8006A3CC    80069E00 
        8006A3D0    8005D003 
        8006A3D4    04080000 
        8006A3D8    80009604    EXE$FORKDSPTH+01C 
   .
   .
   .
 | 
The mechanism array begins at address 8006A3A816 and ends at address 8006A3B816. Its first longword contains 0000000416. The signal array begins at address 8006A3C016 and ends at 8006A3D416. Its first longword contains 0000000516 and its second longword contains 0000000C16. Examination of the signal array shows the following:
Issuing the SDA command EVALUATE/PSL 04080000 makes the following information apparent:
Use the SHOW PAGE_TABLE command to display the system page table, as shown in the following example. The page containing location 80069E0016 is not available to any access mode (a null page); thus, the virtual address is not valid.
| SDA> SHOW PAGE_TABLE System page table ----------------- ADDRESS SVAPTE PTE TYPE PROT BITS PAGTYP LOC STATE TYPE REFCNT BAK SVAPTE FLINK BLINK . . . 80068400 80777B08 7C40FFC8 STX UR K 80068600 80777B0C 7C40FFC8 STX UR K 80068800 80777B10 7C40FFC8 STX UR K 80068A00 80777B14 7C40FFC8 STX UR K 80068C00 80777B18 7C40FFC8 STX UR K 80068E00 80777B1C 7C40FFC8 STX UR K 80069000 80777B20 7C40FFC8 STX UR K 80069200 80777B24 7C40FFC8 STX UR K 80069400 80777B28 7C40FFC8 STX UR K 80069600 80777B2C 7C40FFC8 STX UR K 80069800 80777B30 7C40FFC8 STX UR K 80069A00 80777B34 780016C9 TRANS UR K SYSTEM FREELST 00 01 0 0040FFC8 80777B34 03AF 0E15 80069C00 80777B38 78000E15 TRANS UR K SYSTEM FREELST 00 01 0 0040FFC8 80777B38 16C9 2592 -------- 40 NULL PAGES . . . | 
Because the printer went off line and then came back on line, as shown 
on the console listing in Section 9.2, the problem might exist in the 
driver code. SDA can help you determine which driver might contain the 
faulty code.
9.3.1 Finding the Driver by Using the Program Counter
The first step in determining whether the failing instruction is within a driver is to examine the PC in the signal array using the EXAMINE/INSTRUCTION command. This has two results:
In the following example, the instruction that caused the exception is located within the printer driver.
| SDA> EXAMINE/INSTRUCTION 8005D003 LPDRIVER+2B3 MOVB (R3)+,(R0) | 
If SDA is unable to find a symbol within FFF16 bytes of the memory location you specify, it displays the location as an absolute address. This often, but not always, means the instruction that caused the exception is not part of a device driver.
To determine whether an instruction is part of a driver, use the SHOW DEVICE command to display the starting addresses and lengths of all the drivers in the system. If the address of the failing instruction falls within the range of addresses shown for a given driver, the failing instruction is a part of that driver. The following example shows a partial list of the drivers in the display generated by the SHOW DEVICE command.
| 
I/O data structures 
 
                           DDB list 
                           -------- 
 
    Address    Controller     ACP       Driver      DPT   DPT size 
    -------    ----------     ---       ------      ---   -------- 
 
    80000ECC    HELIUM$DBA    F11XQP    DBDRIVER   800F7AD0  08FD 
    80001040    OPA                     OPERATOR   80001622  0061 
    8000126C    MBA                     MBDRIVER   800015B0  0578 
    80001460    NLA                     NLDRIVER   800015E9  05A3 
    801E2800    HELIUM$DMA    F11XQP    DMDRIVER   800B5CB0  0AA0 
    801E2980    HELIUM$DLA    F11XQP    DLDRIVER   800B6A50  08D0 
   .
   .
   .
 | 
The offsets that SDA displays from nnDRIVER are actually offsets from the DPT. As such, these offsets do not exactly correspond to the offsets shown in driver listings, which represent offsets from the beginning of the program section (PSECT) in which a given instruction appears. Because a driver usually contains more than one PSECT, you must use the driver's map to determine the location of the failing instruction within the driver listing.
To calculate the location of the instruction within the driver listing, refer to the "Program Section Synopsis" section of the driver's map. Determine in which PSECT the offset given by SDA occurs and subtract the base of the PSECT from the offset. You can then use the resulting figure as an index into the driver listing.
If SDA does not display the address as an offset from 
nnDRIVER, but the address is within the address range 
of a driver in the SHOW DEVICE display, you must first subtract the 
address of the DPT from the failing address. Using the result as the 
offset, you can then follow the steps previously outlined for 
determining the index of the instruction into a driver listing.
9.4 Finding the Problem Within the Routine
To find the problem within the routine, examine the printer's driver code. In the system failure discussed in this example, the instruction that caused the exception is MOVB (R3)+,(R0). To check the contents of R3, use the EXAMINE command as follows:
| SDA> EXAMINE R3 R3: 80069E00 "...." | 
The invalid virtual address, as recorded in the signal array, is stored in R3. In the following driver code excerpt, the instruction in question appears at line 599. It is likely that the contents of R3 have been incremented too many times.
| 581 STARTIO: 582 MOVL UCB$L_IRP(R5),R3 ;Retrieve address of I/O packet 583 MOVW IRP$L_MEDIA+2(R3),- 584 UCB$W_BOFF(R5) ;Set number of characters to print 585 MOVL UCB$L_SVAPTE(R5),R3 ;Get address of system buffer 586 MOVAB 12(R3),R3 ;Get address of data area 587 MOVL UCB$L_CRB(R5),R4 ;Get address of CRB 588 MOVL @CRB$L_INTD+VEC$L_IDB(R4),R4 ;Get device CSR address 589 ; 590 ; START NEXT OUTPUT SEQUENCE 591 ; 592 593 10$: ADDL3 #LP_DBR,R4,R0 ;Calculate address of data buffer register 594 MOVZWL UCB$W_BOFF(R5),R1 ;Get number of characters remaining 595 MOVW #^X8080,R2 ;Get control register test mask 596 BRB 25$ ;Start output 597 20$: BITW R2,(R4) (1) ;Printer ready or have paper problem? 598 BLEQ 30$ ;If LEQ not ready or paper problem 599 MOVB (R3)+,(R0) (2) ;Output next character 600 ASHL #1,G^EXE$GL_UBDELAY,-(SP) ;Delay 3*2 u-seconds 601 24$: SOBGEQ (SP),24$ ;Delay loop calibrated to machine speed 602 ADDL #4,SP ;Pop extra longword off stack 603 25$: SOBGEQ R1,20$ (3) ;Any more characters to output? 604 BRW 70$ ;All done, BRW to set return status | 
Explanations of the circled numbers in the example are in Section 9.4.1.
9.4.1 Examining the Routine
The MOVB instruction is part of a routine that reads characters from a buffer and writes them to the printer. The routine contains the loop of instructions that starts at the label 20$ and ends at 25$. This loop executes once for each character in the buffer, performing these steps:
Steps 1 and 2 are repeated until the contents of R1 are 0 or the printer signals that it is not ready.
If the printer signals that it is not ready, the driver transfers control to 30$ (line 598), the beginning of a routine that waits for an interrupt from the printer. When the printer becomes ready, it interrupts the driver and execution of the loop resumes.
Examine the code to determine which variables control the loop.
The byte count (BCNT) is the number of characters in the buffer. Note that BCNT is set by a function decision table (FDT) routine and that this routine sets the value of BCNT to the number of characters in the buffer. In line 586, the starting address of a buffer that is BCNT bytes in size is moved into R3.
Note also that the number of characters left to be printed is represented by the byte offset (BOFF), the offset into the buffer at which the driver finds the next character to be printed. This value controls the number of times the loop is executed.
Because the exception is an access violation, either R3 or R0 must contain an incorrect value. You can determine that R0 is probably valid by the following logic:
Thus, the contents of R3 seem to be the cause of the failure.
The most likely reason that the contents of R3 are wrong is that the 
MOVB instruction at line 599 executes too many times. You can check 
this by comparing the contents of UCB$W_BOFF and UCB$W_BCNT. If 
UCB$W_BOFF contains a larger value than that in UCB$W_BCNT, then R3 
contains a value that is too large, indicating that the MOVB 
instruction has incremented the contents of R3 too many times.
9.4.2 Checking the Values of Key Variables
Because the start-I/O routine requires that R5 contain the address of the printer's UCB, and because several other instructions reference R5 without error before any instruction in the loop does, you can assume that R5 contains the address of the right UCB. To compare BOFF and BCNT, use the command FORMAT @R5 to display the contents of the UCB, as shown in the following session.
| SDA> READ SYS$SYSTEM:SYSDEF.STB SDA> FORMAT @R5 | 
| 
8005D160    UCB$L_FQFL      800039A8 
            UCB$L_RQFL 
            UCB$W_MB_SEED 
            UCB$W_UNIT_SEED 
8005D164    UCB$L_FQBL      800039A8 
            UCB$L_RQBL 
8005D168    UCB$W_SIZE          0122 
8005D16A    UCB$B_TYPE        10 
8005D16B    UCB$B_FIPL      34 
            UCB$B_FLCK 
   .
   .
   .
8005D1C8    UCB$L_SVAPTE    80062720 
8005D1CC    UCB$W_BOFF          0795 
8005D1CE    UCB$W_BCNT      006D 
8005D1D0    UCB$B_ERTCNT          00 
8005D1D1    UCB$B_ERTMAX        00 
8005D1D2    UCB$W_ERRCNT    0000 
   .
   .
   .
SDA> 
 | 
If you have only one printer in your system configuration, you do not need to use the FORMAT command. Instead, you can use the command SHOW DEVICE LP. Because only one printer is connected to the processor, only one UCB is associated with a printer for SDA to display.
The output produced by the FORMAT @R5 command shows that UCB$W_BOFF contains a value greater than that in UCB$W_BCNT; it should be smaller. Therefore, the value stored in BOFF is incorrect.
Thus, the value of BOFF is not the number of characters that remain in 
the buffer. This value is used in calculating an address that is 
referenced at an elevated IPL. When this address is within a null page 
(unreadable in all access modes), an attempt to reference it causes the 
system to fail.
9.4.3 Identifying and Correcting the Defective Code
Examine the printer driver code to locate all instructions that modify UCB$W_BOFF. The value changes in two circumstances:
When the printer times out, the driver should not modify UCB$W_BOFF. It does so, however, in line 631. The driver should modify the contents of UCB$W_BOFF only when it is certain that the printer printed the character. When the printer times out, this is not the case. Furthermore, the wait-for-interrupt routine preserves only registers R3, R4, and R5, so that only those registers can be used unmodified after the execution of the wait-for-interrupt routine. Thus, the use of R1 in line 631 is an error.
To correct the problem, change the WFIKPCH argument (line 616) so that, when the printer times out, the WFIKPCH macro transfers control to 50$ rather than to 40$.
| 607 608 30$: BNEQ 40$ ;If NEQ paper problem 609 ADDW3 #1,R1,UCB$W_BOFF(R5) ;Save number of characters remaining 610 DEVICELOCK - 611 LOCKADDR=UCB$L_DLCK(R5),- ;Lock device interrupts 612 SAVIPL=-(SP) ;Save current IPL 613 BITW #^X80,LP_CSR(R4) ;Is it ready now? 614 BNEQ 35$ ;If NEQ, yes, it's ready 615 BISB #^X40,LP_CSR(R4) ;Set interrupt enable 616 WFIKPCH 40$,#12 ;Wait for ready interrupt 617 IOFORK ;Create a fork process 618 BRB 10$ ; ...and start next output 619 620 35$: 621 DEVICEUNLOCK - 622 LOCKADDR=UCB$L_DLCK(R5),- ;Unlock device interrupts 623 NEWIPL=(SP)+ ;Restore IPL 624 CLRW LP_CSR(R4) ;Disable device interrupts 625 BRB 10$ ;Go transfer more characters 626 ; 627 ; PRINTER HAS PAPER PROBLEM 628 ; 629 630 40$: CLRL UCB$L_LP_OFLCNT(R5) ;Clear offline counter 631 ADDW3 #1,R1,UCB$W_BOFF(R5) ;Save number of characters remaining 632 50$: CLRW LP_CSR(R4) ;Disable printer interrupt 633 IOFORK ;Lower to fork level 634 BBS #UCB$V_CANCEL,UCB$W_STS(R5),80$ ;If set, cancel I/O operation 635 TSTW LP_CSR(R4) ;Printer still have paper problem? 636 BLSS 55$ ;If LSS yes 637 MOVL #15,UCB$L_LP_TIMEOUT(R5) ;Set timeout value 638 BRB 10$ ; ...and start next output | 
If the operating system is not performing well and you want to create a dump you can examine, you must induce a system failure. Occasionally, a device driver or other user-written, kernel-mode code can cause the system to execute a loop of code at a high priority, interfering with normal system operation. This can occur even though you have set a breakpoint in the code if the loop is encountered before the breakpoint. To gain control of the system in such circumstances, you must cause the system to fail and then reboot it.
If the system has suspended all noticeable activity (if it is "hung"), see the examples of causing system failures in Section 10.2.
If you are generating a system crash in response to a system hang, be 
sure to record the PC at the time of the system halt as well as the 
contents of the general registers. Submit this information to Digital, 
along with the Software Performance Report (SPR) and a copy of the 
generated system dump file.
10.1 Meeting Crash Dump Requirements
The following requirements must be met before the system can write a complete crash dump:
| Previous | Next | Contents | Index | 
| ![[Go to the documentation home page]](../../images/buttons/bn_site_home.gif)  ![[How to order documentation]](../../images/buttons/bn_order_docs.gif)  ![[Help on this site]](../../images/buttons/bn_site_help.gif)  ![[How to contact us]](../../images/buttons/bn_comments.gif)  | 
| privacy and legal statement | ||
| 4556PRO_003.HTML | ||