Updated: 11 December 1998 |
OpenVMS VAX System Dump Analyzer Utility Manual
Previous | Contents | Index |
This section discusses how the operating system handles internal errors and suggests procedures that can aid you in determining the causes of these errors. To conclude, it illustrates, through detailed analysis of a sample system failure, how SDA helps you find the causes of operating system problems.
For a complete description of the commands discussed in the sections
that follow, refer to the SDA Commands section.
8.1 General Procedure for Analyzing System Failures
When the operating system detects an internal error so severe that normal operation cannot continue, it signals a condition known as a fatal bugcheck and shuts itself down. A specific bugcheck code describes each such error.
To resolve the problem, you must find the reason for the bugcheck. Most failures are caused by errors in user-written device drivers or other privileged code not supplied by Digital. To identify and correct these errors, you need a listing of the code in question.
Occasionally, a system failure is the result of a hardware failure or an error in code supplied by Digital. A hardware failure requires the attention of Digital Services. To diagnose an error in code supplied by Digital, you need listings of that code, which are available from Digital on CDROM.
Following are the steps you can take to diagnose an error:
SDA> SHOW EXECUTIVE |
SDA> READ/EXECUTIVE SYS$LOADABLE_IMAGES: |
SDA> EXAMINE @PC |
Several conditions result in a bugcheck. Normally, these occasions are
rare. When they do occur, it is likely that they are in the nature of a
fatal exception or an illegal page fault occurring within privileged
code. This section describes the symptoms of these bugchecks. A
discussion of other exceptions and condition handling in general
appears in the OpenVMS System Services Reference Manual.
8.2.1 Fatal Exceptions
An exception is fatal when it occurs while the following conditions exist:
When the system fails, the operating system reports the approximate cause of the failure on the console terminal. SDA displays a similar message when you issue a SHOW CRASH command. For instance, for a fatal exception, SDA can display one of these messages:
FATALEXCPT, Fatal executive or kernel mode exception INVEXCEPTN, Exception while above ASTDEL or on interrupt stack SSRVEXCEPT, Unexpected system service exception |
Although several exception conditions are possible, access violations are the most common. When the hardware detects an access violation, information useful in finding the cause of the violation is pushed onto either the kernel stack or the interrupt stack. If the access violation occurs when the hardware is using the interrupt stack, this information appears on the interrupt stack.
The INVEXCEPTN, SSRVEXCEPT, and FATALEXCPT bugchecks place two argument lists, known as the mechanism and signal arrays, on the stack.
The SSRVEXCEPT and FATALEXCPT bugchecks push an additional argument list onto the stack above these arrays; INVEXCEPTN does not. This pointer array (see Figure SDA-1) contains the number 2 in its first longword, indicating that the following two longwords complete the array. The second longword contains the stack address of the signal array; the third contains the stack address of the mechanism array.
Figure SDA-1 Pointer Argument List on the Stack
The first longword of the mechanism array (see Figure SDA-2) contains a 4, indicating that the four subsequent longwords complete the array. These four longwords are used by the procedures that search for a condition handler and report exceptions.
Figure SDA-2 Mechanism Array
The values in the mechanism array are the following:
Value | Meaning |
---|---|
00000004 | Number of longwords that follow. In a mechanism array, this value is always 4. |
Frame | Address of the FP (frame pointer) of the establisher's call frame. |
Depth | Depth of the search for a condition handler. |
R0 | Contents of R0 at the time of the exception. |
R1 | Contents of R1 at the time of the exception. |
The signal array (see Figure SDA-3) appears somewhat further down the stack. A signal array contains the exception code, zero or more exception parameters, the PC, and the PSL. The size of a signal array can thus vary from exception to exception.
Figure SDA-3 Signal Array
For access violations, the signal array is set up as follows:
Value | Meaning |
---|---|
00000005 | Number of longwords that follow. For access violations, this value is always 5. |
0000000C | Exception code. The value 0C 16 represents an access violation. You can identify the exception code by using the SDA command EVALUATE/CONDITION. |
Reason mask | Longword mask. If bit 0 of this longword is set, the failing instruction (at the PC saved below) caused a length violation. If bit 1 is set, it referred to a location whose page table entry is in a "no access" page. Bit 2 indicates the type of access used by the failing instruction: it is set for write and modify operations and clear for read operations. |
Virtual address | Virtual address that the failing instruction tried to reference. |
PC | PC whose execution resulted in the exception. |
PSL | PSL at the time of the exception. |
In the case of a fatal exception, you can find the code that signaled
it by examining the PC in the signal array. Use the SHOW STACK command
to display the stack in use when the failure occurred and then locate
the mechanism and signal arrays. Once you obtain the PC, which points
to the instruction that signaled the exception, you can identify the
module where the instruction is located by following the instructions
in Section 9.3.
8.2.2 Illegal Page Faults
A PGFIPLHI bugcheck occurs when a page fault occurs while the interrupt priority level (IPL) is greater than 2 (IPL$_ASTDEL). When the system fails because of an illegal page fault, the following message appears on the console terminal:
PGFIPLHI, page fault with IPL too high |
When an illegal page fault occurs, the stack appears as shown in Figure SDA-4.
Figure SDA-4 Stack Following an Illegal Page-Fault Error
Six longwords describe the error:
Longword | Contents |
---|---|
R4 | Contents of R4 at the time of the bugcheck. |
R5 | Contents of R5 at the time of the bugcheck. |
Reason mask | Longword mask. If bit 0 of this longword is set, the failing instruction (at the PC saved below) caused a length violation. If bit 1 is set, it referred to a location whose page table entry is in an "access" page. Bit 2 indicates the type of access used by the failing instruction: it is set for write and modify operations and clear for read operations. |
Virtual address | Virtual address being referenced by the instruction that caused the page fault. |
PC | PC containing the address of the instruction that caused the page fault. |
PSL | PSL at the time of the page fault. |
If the operating system detects a page fault while the IPL is higher
than IPL$_ASTDEL, you can obtain the address of the instruction that
caused the fault by examining the PC pushed onto the current operating
stack. Follow the steps outlined in Section 9.3 to determine which
module issued the instruction.
9 A Sample System Failure
This section steps through the analysis of a system failure using, as an example, a printer driver. Three events lead up to this failure:
The following sections describe the actions to take in investigating
the causes of this system crash.
9.1 Identifying the Bugcheck
First, invoke SDA to analyze the system dump file. The initialization message indicates the type of bugcheck that occurred as follows:
Dump taken on 31-JAN-1993 16:34:31.23 INVEXCEPTN, Exception while above ASTDEL or on interrupt stack SDA> |
An exception occurred that caused the system to signal a bugcheck, and
signal and mechanism arrays have been created on the current operating
stack.
9.2 Identifying the Exception
Use the SHOW STACK command to display the current operating stack. In this case, it is the interrupt stack. The following example shows the interrupt stack and the signal and mechanism arrays. See the SHOW STACK command for a complete description of the format of the stack display.
CPU 01 Processor stack ---------------------- Current operating stack (INTERRUPT) 8006A378 8000844B ACP$WRITEBLK+0A0 . . . SP => 8006A398 7FFDC340 8006A39C 8006A3A0 8006A3A0 80004E7D EXE$REFLECT+0D4 8006A3A4 04080009 8006A3A8 00000004 8006A3AC 7FFDC368 8006A3B0 FFFFFFFD 8006A3B4 8001774E 8006A3B8 0000074F 8006A3BC 00000001 8006A3C0 00000005 8006A3C4 0000000C 8006A3C8 00000000 8006A3CC 80069E00 8006A3D0 8005D003 8006A3D4 04080000 8006A3D8 80009604 EXE$FORKDSPTH+01C . . . |
The mechanism array begins at address 8006A3A816 and ends at address 8006A3B816. Its first longword contains 0000000416. The signal array begins at address 8006A3C016 and ends at 8006A3D416. Its first longword contains 0000000516 and its second longword contains 0000000C16. Examination of the signal array shows the following:
Issuing the SDA command EVALUATE/PSL 04080000 makes the following information apparent:
Use the SHOW PAGE_TABLE command to display the system page table, as shown in the following example. The page containing location 80069E0016 is not available to any access mode (a null page); thus, the virtual address is not valid.
SDA> SHOW PAGE_TABLE System page table ----------------- ADDRESS SVAPTE PTE TYPE PROT BITS PAGTYP LOC STATE TYPE REFCNT BAK SVAPTE FLINK BLINK . . . 80068400 80777B08 7C40FFC8 STX UR K 80068600 80777B0C 7C40FFC8 STX UR K 80068800 80777B10 7C40FFC8 STX UR K 80068A00 80777B14 7C40FFC8 STX UR K 80068C00 80777B18 7C40FFC8 STX UR K 80068E00 80777B1C 7C40FFC8 STX UR K 80069000 80777B20 7C40FFC8 STX UR K 80069200 80777B24 7C40FFC8 STX UR K 80069400 80777B28 7C40FFC8 STX UR K 80069600 80777B2C 7C40FFC8 STX UR K 80069800 80777B30 7C40FFC8 STX UR K 80069A00 80777B34 780016C9 TRANS UR K SYSTEM FREELST 00 01 0 0040FFC8 80777B34 03AF 0E15 80069C00 80777B38 78000E15 TRANS UR K SYSTEM FREELST 00 01 0 0040FFC8 80777B38 16C9 2592 -------- 40 NULL PAGES . . . |
Because the printer went off line and then came back on line, as shown
on the console listing in Section 9.2, the problem might exist in the
driver code. SDA can help you determine which driver might contain the
faulty code.
9.3.1 Finding the Driver by Using the Program Counter
The first step in determining whether the failing instruction is within a driver is to examine the PC in the signal array using the EXAMINE/INSTRUCTION command. This has two results:
In the following example, the instruction that caused the exception is located within the printer driver.
SDA> EXAMINE/INSTRUCTION 8005D003 LPDRIVER+2B3 MOVB (R3)+,(R0) |
If SDA is unable to find a symbol within FFF16 bytes of the memory location you specify, it displays the location as an absolute address. This often, but not always, means the instruction that caused the exception is not part of a device driver.
To determine whether an instruction is part of a driver, use the SHOW DEVICE command to display the starting addresses and lengths of all the drivers in the system. If the address of the failing instruction falls within the range of addresses shown for a given driver, the failing instruction is a part of that driver. The following example shows a partial list of the drivers in the display generated by the SHOW DEVICE command.
I/O data structures DDB list -------- Address Controller ACP Driver DPT DPT size ------- ---------- --- ------ --- -------- 80000ECC HELIUM$DBA F11XQP DBDRIVER 800F7AD0 08FD 80001040 OPA OPERATOR 80001622 0061 8000126C MBA MBDRIVER 800015B0 0578 80001460 NLA NLDRIVER 800015E9 05A3 801E2800 HELIUM$DMA F11XQP DMDRIVER 800B5CB0 0AA0 801E2980 HELIUM$DLA F11XQP DLDRIVER 800B6A50 08D0 . . . |
The offsets that SDA displays from nnDRIVER are actually offsets from the DPT. As such, these offsets do not exactly correspond to the offsets shown in driver listings, which represent offsets from the beginning of the program section (PSECT) in which a given instruction appears. Because a driver usually contains more than one PSECT, you must use the driver's map to determine the location of the failing instruction within the driver listing.
To calculate the location of the instruction within the driver listing, refer to the "Program Section Synopsis" section of the driver's map. Determine in which PSECT the offset given by SDA occurs and subtract the base of the PSECT from the offset. You can then use the resulting figure as an index into the driver listing.
If SDA does not display the address as an offset from
nnDRIVER, but the address is within the address range
of a driver in the SHOW DEVICE display, you must first subtract the
address of the DPT from the failing address. Using the result as the
offset, you can then follow the steps previously outlined for
determining the index of the instruction into a driver listing.
9.4 Finding the Problem Within the Routine
To find the problem within the routine, examine the printer's driver code. In the system failure discussed in this example, the instruction that caused the exception is MOVB (R3)+,(R0). To check the contents of R3, use the EXAMINE command as follows:
SDA> EXAMINE R3 R3: 80069E00 "...." |
The invalid virtual address, as recorded in the signal array, is stored in R3. In the following driver code excerpt, the instruction in question appears at line 599. It is likely that the contents of R3 have been incremented too many times.
581 STARTIO: 582 MOVL UCB$L_IRP(R5),R3 ;Retrieve address of I/O packet 583 MOVW IRP$L_MEDIA+2(R3),- 584 UCB$W_BOFF(R5) ;Set number of characters to print 585 MOVL UCB$L_SVAPTE(R5),R3 ;Get address of system buffer 586 MOVAB 12(R3),R3 ;Get address of data area 587 MOVL UCB$L_CRB(R5),R4 ;Get address of CRB 588 MOVL @CRB$L_INTD+VEC$L_IDB(R4),R4 ;Get device CSR address 589 ; 590 ; START NEXT OUTPUT SEQUENCE 591 ; 592 593 10$: ADDL3 #LP_DBR,R4,R0 ;Calculate address of data buffer register 594 MOVZWL UCB$W_BOFF(R5),R1 ;Get number of characters remaining 595 MOVW #^X8080,R2 ;Get control register test mask 596 BRB 25$ ;Start output 597 20$: BITW R2,(R4) (1) ;Printer ready or have paper problem? 598 BLEQ 30$ ;If LEQ not ready or paper problem 599 MOVB (R3)+,(R0) (2) ;Output next character 600 ASHL #1,G^EXE$GL_UBDELAY,-(SP) ;Delay 3*2 u-seconds 601 24$: SOBGEQ (SP),24$ ;Delay loop calibrated to machine speed 602 ADDL #4,SP ;Pop extra longword off stack 603 25$: SOBGEQ R1,20$ (3) ;Any more characters to output? 604 BRW 70$ ;All done, BRW to set return status |
Explanations of the circled numbers in the example are in Section 9.4.1.
Previous | Next | Contents | Index |
Copyright © Compaq Computer Corporation 1998. All rights reserved. Legal |
4556PRO_003.HTML
|