Document revision date: 19 July 1999 | |
Previous | Contents | Index |
A PGFIPLHI bugcheck occurs when a page fault occurs while the interrupt priority level (IPL) is greater than 2 (IPL$_ASTDEL). When the system fails because of an illegal page fault, the following message appears on the console terminal:
PGFIPLHI, page fault with IPL too high |
When an illegal page fault occurs, the stack appears as shown in Figure SDA-4.
Figure SDA-4 Stack Following an Illegal Page-Fault Error
Six longwords describe the error:
Longword | Contents |
---|---|
R4 | Contents of R4 at the time of the bugcheck. |
R5 | Contents of R5 at the time of the bugcheck. |
Reason mask | Longword mask. If bit 0 of this longword is set, the failing instruction (at the PC saved below) caused a length violation. If bit 1 is set, it referred to a location whose page table entry is in an "access" page. Bit 2 indicates the type of access used by the failing instruction: it is set for write and modify operations and clear for read operations. |
Virtual address | Virtual address being referenced by the instruction that caused the page fault. |
PC | PC containing the address of the instruction that caused the page fault. |
PSL | PSL at the time of the page fault. |
If the operating system detects a page fault while the IPL is higher
than IPL$_ASTDEL, you can obtain the address of the instruction that
caused the fault by examining the PC pushed onto the current operating
stack. Follow the steps outlined in Section 9.3 to determine which
module issued the instruction.
9 A Sample System Failure
This section steps through the analysis of a system failure using, as an example, a printer driver. Three events lead up to this failure:
The following sections describe the actions to take in investigating
the causes of this system crash.
9.1 Identifying the Bugcheck
First, invoke SDA to analyze the system dump file. The initialization message indicates the type of bugcheck that occurred as follows:
Dump taken on 31-JAN-1993 16:34:31.23 INVEXCEPTN, Exception while above ASTDEL or on interrupt stack SDA> |
An exception occurred that caused the system to signal a bugcheck, and
signal and mechanism arrays have been created on the current operating
stack.
9.2 Identifying the Exception
Use the SHOW STACK command to display the current operating stack. In this case, it is the interrupt stack. The following example shows the interrupt stack and the signal and mechanism arrays. See the SHOW STACK command for a complete description of the format of the stack display.
CPU 01 Processor stack ---------------------- Current operating stack (INTERRUPT) 8006A378 8000844B ACP$WRITEBLK+0A0 . . . SP => 8006A398 7FFDC340 8006A39C 8006A3A0 8006A3A0 80004E7D EXE$REFLECT+0D4 8006A3A4 04080009 8006A3A8 00000004 8006A3AC 7FFDC368 8006A3B0 FFFFFFFD 8006A3B4 8001774E 8006A3B8 0000074F 8006A3BC 00000001 8006A3C0 00000005 8006A3C4 0000000C 8006A3C8 00000000 8006A3CC 80069E00 8006A3D0 8005D003 8006A3D4 04080000 8006A3D8 80009604 EXE$FORKDSPTH+01C . . . |
The mechanism array begins at address 8006A3A816 and ends at address 8006A3B816. Its first longword contains 0000000416. The signal array begins at address 8006A3C016 and ends at 8006A3D416. Its first longword contains 0000000516 and its second longword contains 0000000C16. Examination of the signal array shows the following:
Issuing the SDA command EVALUATE/PSL 04080000 makes the following information apparent:
Use the SHOW PAGE_TABLE command to display the system page table, as shown in the following example. The page containing location 80069E0016 is not available to any access mode (a null page); thus, the virtual address is not valid.
SDA> SHOW PAGE_TABLE System page table ----------------- ADDRESS SVAPTE PTE TYPE PROT BITS PAGTYP LOC STATE TYPE REFCNT BAK SVAPTE FLINK BLINK . . . 80068400 80777B08 7C40FFC8 STX UR K 80068600 80777B0C 7C40FFC8 STX UR K 80068800 80777B10 7C40FFC8 STX UR K 80068A00 80777B14 7C40FFC8 STX UR K 80068C00 80777B18 7C40FFC8 STX UR K 80068E00 80777B1C 7C40FFC8 STX UR K 80069000 80777B20 7C40FFC8 STX UR K 80069200 80777B24 7C40FFC8 STX UR K 80069400 80777B28 7C40FFC8 STX UR K 80069600 80777B2C 7C40FFC8 STX UR K 80069800 80777B30 7C40FFC8 STX UR K 80069A00 80777B34 780016C9 TRANS UR K SYSTEM FREELST 00 01 0 0040FFC8 80777B34 03AF 0E15 80069C00 80777B38 78000E15 TRANS UR K SYSTEM FREELST 00 01 0 0040FFC8 80777B38 16C9 2592 -------- 40 NULL PAGES . . . |
Because the printer went off line and then came back on line, as shown
on the console listing in Section 9.2, the problem might exist in the
driver code. SDA can help you determine which driver might contain the
faulty code.
9.3.1 Finding the Driver by Using the Program Counter
The first step in determining whether the failing instruction is within a driver is to examine the PC in the signal array using the EXAMINE/INSTRUCTION command. This has two results:
In the following example, the instruction that caused the exception is located within the printer driver.
SDA> EXAMINE/INSTRUCTION 8005D003 LPDRIVER+2B3 MOVB (R3)+,(R0) |
If SDA is unable to find a symbol within FFF16 bytes of the memory location you specify, it displays the location as an absolute address. This often, but not always, means the instruction that caused the exception is not part of a device driver.
To determine whether an instruction is part of a driver, use the SHOW DEVICE command to display the starting addresses and lengths of all the drivers in the system. If the address of the failing instruction falls within the range of addresses shown for a given driver, the failing instruction is a part of that driver. The following example shows a partial list of the drivers in the display generated by the SHOW DEVICE command.
I/O data structures DDB list -------- Address Controller ACP Driver DPT DPT size ------- ---------- --- ------ --- -------- 80000ECC HELIUM$DBA F11XQP DBDRIVER 800F7AD0 08FD 80001040 OPA OPERATOR 80001622 0061 8000126C MBA MBDRIVER 800015B0 0578 80001460 NLA NLDRIVER 800015E9 05A3 801E2800 HELIUM$DMA F11XQP DMDRIVER 800B5CB0 0AA0 801E2980 HELIUM$DLA F11XQP DLDRIVER 800B6A50 08D0 . . . |
The offsets that SDA displays from nnDRIVER are actually offsets from the DPT. As such, these offsets do not exactly correspond to the offsets shown in driver listings, which represent offsets from the beginning of the program section (PSECT) in which a given instruction appears. Because a driver usually contains more than one PSECT, you must use the driver's map to determine the location of the failing instruction within the driver listing.
To calculate the location of the instruction within the driver listing, refer to the "Program Section Synopsis" section of the driver's map. Determine in which PSECT the offset given by SDA occurs and subtract the base of the PSECT from the offset. You can then use the resulting figure as an index into the driver listing.
If SDA does not display the address as an offset from
nnDRIVER, but the address is within the address range
of a driver in the SHOW DEVICE display, you must first subtract the
address of the DPT from the failing address. Using the result as the
offset, you can then follow the steps previously outlined for
determining the index of the instruction into a driver listing.
9.4 Finding the Problem Within the Routine
To find the problem within the routine, examine the printer's driver code. In the system failure discussed in this example, the instruction that caused the exception is MOVB (R3)+,(R0). To check the contents of R3, use the EXAMINE command as follows:
SDA> EXAMINE R3 R3: 80069E00 "...." |
The invalid virtual address, as recorded in the signal array, is stored in R3. In the following driver code excerpt, the instruction in question appears at line 599. It is likely that the contents of R3 have been incremented too many times.
581 STARTIO: 582 MOVL UCB$L_IRP(R5),R3 ;Retrieve address of I/O packet 583 MOVW IRP$L_MEDIA+2(R3),- 584 UCB$W_BOFF(R5) ;Set number of characters to print 585 MOVL UCB$L_SVAPTE(R5),R3 ;Get address of system buffer 586 MOVAB 12(R3),R3 ;Get address of data area 587 MOVL UCB$L_CRB(R5),R4 ;Get address of CRB 588 MOVL @CRB$L_INTD+VEC$L_IDB(R4),R4 ;Get device CSR address 589 ; 590 ; START NEXT OUTPUT SEQUENCE 591 ; 592 593 10$: ADDL3 #LP_DBR,R4,R0 ;Calculate address of data buffer register 594 MOVZWL UCB$W_BOFF(R5),R1 ;Get number of characters remaining 595 MOVW #^X8080,R2 ;Get control register test mask 596 BRB 25$ ;Start output 597 20$: BITW R2,(R4) (1) ;Printer ready or have paper problem? 598 BLEQ 30$ ;If LEQ not ready or paper problem 599 MOVB (R3)+,(R0) (2) ;Output next character 600 ASHL #1,G^EXE$GL_UBDELAY,-(SP) ;Delay 3*2 u-seconds 601 24$: SOBGEQ (SP),24$ ;Delay loop calibrated to machine speed 602 ADDL #4,SP ;Pop extra longword off stack 603 25$: SOBGEQ R1,20$ (3) ;Any more characters to output? 604 BRW 70$ ;All done, BRW to set return status |
Explanations of the circled numbers in the example are in Section 9.4.1.
9.4.1 Examining the Routine
The MOVB instruction is part of a routine that reads characters from a buffer and writes them to the printer. The routine contains the loop of instructions that starts at the label 20$ and ends at 25$. This loop executes once for each character in the buffer, performing these steps:
Steps 1 and 2 are repeated until the contents of R1 are 0 or the printer signals that it is not ready.
If the printer signals that it is not ready, the driver transfers control to 30$ (line 598), the beginning of a routine that waits for an interrupt from the printer. When the printer becomes ready, it interrupts the driver and execution of the loop resumes.
Examine the code to determine which variables control the loop.
The byte count (BCNT) is the number of characters in the buffer. Note that BCNT is set by a function decision table (FDT) routine and that this routine sets the value of BCNT to the number of characters in the buffer. In line 586, the starting address of a buffer that is BCNT bytes in size is moved into R3.
Note also that the number of characters left to be printed is represented by the byte offset (BOFF), the offset into the buffer at which the driver finds the next character to be printed. This value controls the number of times the loop is executed.
Because the exception is an access violation, either R3 or R0 must contain an incorrect value. You can determine that R0 is probably valid by the following logic:
Thus, the contents of R3 seem to be the cause of the failure.
The most likely reason that the contents of R3 are wrong is that the
MOVB instruction at line 599 executes too many times. You can check
this by comparing the contents of UCB$W_BOFF and UCB$W_BCNT. If
UCB$W_BOFF contains a larger value than that in UCB$W_BCNT, then R3
contains a value that is too large, indicating that the MOVB
instruction has incremented the contents of R3 too many times.
9.4.2 Checking the Values of Key Variables
Because the start-I/O routine requires that R5 contain the address of the printer's UCB, and because several other instructions reference R5 without error before any instruction in the loop does, you can assume that R5 contains the address of the right UCB. To compare BOFF and BCNT, use the command FORMAT @R5 to display the contents of the UCB, as shown in the following session.
SDA> READ SYS$SYSTEM:SYSDEF.STB SDA> FORMAT @R5 |
8005D160 UCB$L_FQFL 800039A8 UCB$L_RQFL UCB$W_MB_SEED UCB$W_UNIT_SEED 8005D164 UCB$L_FQBL 800039A8 UCB$L_RQBL 8005D168 UCB$W_SIZE 0122 8005D16A UCB$B_TYPE 10 8005D16B UCB$B_FIPL 34 UCB$B_FLCK . . . 8005D1C8 UCB$L_SVAPTE 80062720 8005D1CC UCB$W_BOFF 0795 8005D1CE UCB$W_BCNT 006D 8005D1D0 UCB$B_ERTCNT 00 8005D1D1 UCB$B_ERTMAX 00 8005D1D2 UCB$W_ERRCNT 0000 . . . SDA> |
If you have only one printer in your system configuration, you do not need to use the FORMAT command. Instead, you can use the command SHOW DEVICE LP. Because only one printer is connected to the processor, only one UCB is associated with a printer for SDA to display.
The output produced by the FORMAT @R5 command shows that UCB$W_BOFF contains a value greater than that in UCB$W_BCNT; it should be smaller. Therefore, the value stored in BOFF is incorrect.
Thus, the value of BOFF is not the number of characters that remain in
the buffer. This value is used in calculating an address that is
referenced at an elevated IPL. When this address is within a null page
(unreadable in all access modes), an attempt to reference it causes the
system to fail.
9.4.3 Identifying and Correcting the Defective Code
Examine the printer driver code to locate all instructions that modify UCB$W_BOFF. The value changes in two circumstances:
When the printer times out, the driver should not modify UCB$W_BOFF. It does so, however, in line 631. The driver should modify the contents of UCB$W_BOFF only when it is certain that the printer printed the character. When the printer times out, this is not the case. Furthermore, the wait-for-interrupt routine preserves only registers R3, R4, and R5, so that only those registers can be used unmodified after the execution of the wait-for-interrupt routine. Thus, the use of R1 in line 631 is an error.
To correct the problem, change the WFIKPCH argument (line 616) so that, when the printer times out, the WFIKPCH macro transfers control to 50$ rather than to 40$.
607 608 30$: BNEQ 40$ ;If NEQ paper problem 609 ADDW3 #1,R1,UCB$W_BOFF(R5) ;Save number of characters remaining 610 DEVICELOCK - 611 LOCKADDR=UCB$L_DLCK(R5),- ;Lock device interrupts 612 SAVIPL=-(SP) ;Save current IPL 613 BITW #^X80,LP_CSR(R4) ;Is it ready now? 614 BNEQ 35$ ;If NEQ, yes, it's ready 615 BISB #^X40,LP_CSR(R4) ;Set interrupt enable 616 WFIKPCH 40$,#12 ;Wait for ready interrupt 617 IOFORK ;Create a fork process 618 BRB 10$ ; ...and start next output 619 620 35$: 621 DEVICEUNLOCK - 622 LOCKADDR=UCB$L_DLCK(R5),- ;Unlock device interrupts 623 NEWIPL=(SP)+ ;Restore IPL 624 CLRW LP_CSR(R4) ;Disable device interrupts 625 BRB 10$ ;Go transfer more characters 626 ; 627 ; PRINTER HAS PAPER PROBLEM 628 ; 629 630 40$: CLRL UCB$L_LP_OFLCNT(R5) ;Clear offline counter 631 ADDW3 #1,R1,UCB$W_BOFF(R5) ;Save number of characters remaining 632 50$: CLRW LP_CSR(R4) ;Disable printer interrupt 633 IOFORK ;Lower to fork level 634 BBS #UCB$V_CANCEL,UCB$W_STS(R5),80$ ;If set, cancel I/O operation 635 TSTW LP_CSR(R4) ;Printer still have paper problem? 636 BLSS 55$ ;If LSS yes 637 MOVL #15,UCB$L_LP_TIMEOUT(R5) ;Set timeout value 638 BRB 10$ ; ...and start next output |
If the operating system is not performing well and you want to create a dump you can examine, you must induce a system failure. Occasionally, a device driver or other user-written, kernel-mode code can cause the system to execute a loop of code at a high priority, interfering with normal system operation. This can occur even though you have set a breakpoint in the code if the loop is encountered before the breakpoint. To gain control of the system in such circumstances, you must cause the system to fail and then reboot it.
If the system has suspended all noticeable activity (if it is "hung"), see the examples of causing system failures in Section 10.2.
If you are generating a system crash in response to a system hang, be
sure to record the PC at the time of the system halt as well as the
contents of the general registers. Submit this information to Digital,
along with the Software Performance Report (SPR) and a copy of the
generated system dump file.
10.1 Meeting Crash Dump Requirements
The following requirements must be met before the system can write a complete crash dump:
Previous | Next | Contents | Index |
privacy and legal statement | ||
4556PRO_003.HTML |