[OpenVMS documentation]
[Site home] [Send comments] [Help with this site] [How to order documentation] [OpenVMS site] [Compaq site]
Updated: 11 December 1998

OpenVMS VAX System Dump Analyzer Utility Manual


Previous Contents Index

8 Investigating System Failures

This section discusses how the operating system handles internal errors and suggests procedures that can aid you in determining the causes of these errors. To conclude, it illustrates, through detailed analysis of a sample system failure, how SDA helps you find the causes of operating system problems.

For a complete description of the commands discussed in the sections that follow, refer to the SDA Commands section.

8.1 General Procedure for Analyzing System Failures

When the operating system detects an internal error so severe that normal operation cannot continue, it signals a condition known as a fatal bugcheck and shuts itself down. A specific bugcheck code describes each such error.

To resolve the problem, you must find the reason for the bugcheck. Most failures are caused by errors in user-written device drivers or other privileged code not supplied by Digital. To identify and correct these errors, you need a listing of the code in question.

Occasionally, a system failure is the result of a hardware failure or an error in code supplied by Digital. A hardware failure requires the attention of Digital Services. To diagnose an error in code supplied by Digital, you need listings of that code, which are available from Digital on CDROM.

Following are the steps you can take to diagnose an error:

  1. Start the search for the error by locating the line of code that signaled the bugcheck. Invoke SDA and use the SHOW CRASH command to display the contents of the program counter (PC). The PC contains the address of the instruction immediately following the instruction that signaled the bugcheck.
  2. Use the SHOW STACK command to display the contents of the stack. The PC often contains an address in the exception handler. This address is the address of the instruction that signaled the bugcheck, but not the address of the instruction that caused it. In this case, the address of the instruction that caused the bugcheck is located on the stack. See Section 8.2 for information about how to proceed for several types of bugchecks.
  3. Once you have found the address of the instruction that caused the bugcheck, you need to find the module in which the failing instruction resides. Use the SHOW DEVICE command to determine whether the instruction is part of a device driver.
  4. To determine the general cause of the system failure, examine the code that signaled the bugcheck.

8.2 Fatal Bugcheck Conditions

Several conditions result in a bugcheck. Normally, these occasions are rare. When they do occur, it is likely that they are in the nature of a fatal exception or an illegal page fault occurring within privileged code. This section describes the symptoms of these bugchecks. A discussion of other exceptions and condition handling in general appears in the OpenVMS System Services Reference Manual.

8.2.1 Fatal Exceptions

An exception is fatal when it occurs while the following conditions exist:

When the system fails, the operating system reports the approximate cause of the failure on the console terminal. SDA displays a similar message when you issue a SHOW CRASH command. For instance, for a fatal exception, SDA can display one of these messages:


FATALEXCPT, Fatal executive or kernel mode exception 
 
INVEXCEPTN, Exception while above ASTDEL or on interrupt stack 
 
SSRVEXCEPT, Unexpected system service exception 

Although several exception conditions are possible, access violations are the most common. When the hardware detects an access violation, information useful in finding the cause of the violation is pushed onto either the kernel stack or the interrupt stack. If the access violation occurs when the hardware is using the interrupt stack, this information appears on the interrupt stack.

The INVEXCEPTN, SSRVEXCEPT, and FATALEXCPT bugchecks place two argument lists, known as the mechanism and signal arrays, on the stack.

The SSRVEXCEPT and FATALEXCPT bugchecks push an additional argument list onto the stack above these arrays; INVEXCEPTN does not. This pointer array (see Figure SDA-1) contains the number 2 in its first longword, indicating that the following two longwords complete the array. The second longword contains the stack address of the signal array; the third contains the stack address of the mechanism array.

Figure SDA-1 Pointer Argument List on the Stack


The first longword of the mechanism array (see Figure SDA-2) contains a 4, indicating that the four subsequent longwords complete the array. These four longwords are used by the procedures that search for a condition handler and report exceptions.

Figure SDA-2 Mechanism Array


The values in the mechanism array are the following:
Value Meaning
00000004 Number of longwords that follow. In a mechanism array, this value is always 4.
Frame Address of the FP (frame pointer) of the establisher's call frame.
Depth Depth of the search for a condition handler.
R0 Contents of R0 at the time of the exception.
R1 Contents of R1 at the time of the exception.

The signal array (see Figure SDA-3) appears somewhat further down the stack. A signal array contains the exception code, zero or more exception parameters, the PC, and the PSL. The size of a signal array can thus vary from exception to exception.

Figure SDA-3 Signal Array


For access violations, the signal array is set up as follows:
Value Meaning
00000005 Number of longwords that follow. For access violations, this value is always 5.
0000000C Exception code. The value 0C 16 represents an access violation. You can identify the exception code by using the SDA command EVALUATE/CONDITION.
Reason mask Longword mask. If bit 0 of this longword is set, the failing instruction (at the PC saved below) caused a length violation. If bit 1 is set, it referred to a location whose page table entry is in a "no access" page. Bit 2 indicates the type of access used by the failing instruction: it is set for write and modify operations and clear for read operations.
Virtual address Virtual address that the failing instruction tried to reference.
PC PC whose execution resulted in the exception.
PSL PSL at the time of the exception.

In the case of a fatal exception, you can find the code that signaled it by examining the PC in the signal array. Use the SHOW STACK command to display the stack in use when the failure occurred and then locate the mechanism and signal arrays. Once you obtain the PC, which points to the instruction that signaled the exception, you can identify the module where the instruction is located by following the instructions in Section 9.3.

8.2.2 Illegal Page Faults

A PGFIPLHI bugcheck occurs when a page fault occurs while the interrupt priority level (IPL) is greater than 2 (IPL$_ASTDEL). When the system fails because of an illegal page fault, the following message appears on the console terminal:


PGFIPLHI, page fault with IPL too high 

When an illegal page fault occurs, the stack appears as shown in Figure SDA-4.

Figure SDA-4 Stack Following an Illegal Page-Fault Error


Six longwords describe the error:
Longword Contents
R4 Contents of R4 at the time of the bugcheck.
R5 Contents of R5 at the time of the bugcheck.
Reason mask Longword mask. If bit 0 of this longword is set, the failing instruction (at the PC saved below) caused a length violation. If bit 1 is set, it referred to a location whose page table entry is in an "access" page. Bit 2 indicates the type of access used by the failing instruction: it is set for write and modify operations and clear for read operations.
Virtual address Virtual address being referenced by the instruction that caused the page fault.
PC PC containing the address of the instruction that caused the page fault.
PSL PSL at the time of the page fault.

If the operating system detects a page fault while the IPL is higher than IPL$_ASTDEL, you can obtain the address of the instruction that caused the fault by examining the PC pushed onto the current operating stack. Follow the steps outlined in Section 9.3 to determine which module issued the instruction.

9 A Sample System Failure

This section steps through the analysis of a system failure using, as an example, a printer driver. Three events lead up to this failure:

  1. The line printer goes off line for 3 hours.
  2. The line printer comes back on line.
  3. The operating system signals a bugcheck, writes information to the system dump file, and shuts itself down.

The following sections describe the actions to take in investigating the causes of this system crash.

9.1 Identifying the Bugcheck

First, invoke SDA to analyze the system dump file. The initialization message indicates the type of bugcheck that occurred as follows:


 
Dump taken on 31-JAN-1993 16:34:31.23 
INVEXCEPTN, Exception while above ASTDEL or on interrupt stack 
 
SDA> 

An exception occurred that caused the system to signal a bugcheck, and signal and mechanism arrays have been created on the current operating stack.

9.2 Identifying the Exception

Use the SHOW STACK command to display the current operating stack. In this case, it is the interrupt stack. The following example shows the interrupt stack and the signal and mechanism arrays. See the SHOW STACK command for a complete description of the format of the stack display.


CPU 01 Processor stack 
---------------------- 
Current operating stack (INTERRUPT) 
 
        8006A378    8000844B    ACP$WRITEBLK+0A0 
   .
   .
   .
  SP => 8006A398    7FFDC340 
        8006A39C    8006A3A0 
        8006A3A0    80004E7D    EXE$REFLECT+0D4 
        8006A3A4    04080009 
        8006A3A8    00000004 
        8006A3AC    7FFDC368 
        8006A3B0    FFFFFFFD 
        8006A3B4    8001774E 
        8006A3B8    0000074F 
        8006A3BC    00000001 
        8006A3C0    00000005 
        8006A3C4    0000000C 
        8006A3C8    00000000 
        8006A3CC    80069E00 
        8006A3D0    8005D003 
        8006A3D4    04080000 
        8006A3D8    80009604    EXE$FORKDSPTH+01C 
   .
   .
   .

The mechanism array begins at address 8006A3A816 and ends at address 8006A3B816. Its first longword contains 0000000416. The signal array begins at address 8006A3C016 and ends at 8006A3D416. Its first longword contains 0000000516 and its second longword contains 0000000C16. Examination of the signal array shows the following:

Issuing the SDA command EVALUATE/PSL 04080000 makes the following information apparent:

Use the SHOW PAGE_TABLE command to display the system page table, as shown in the following example. The page containing location 80069E0016 is not available to any access mode (a null page); thus, the virtual address is not valid.


SDA> SHOW PAGE_TABLE
 
 
System page table
-----------------
  
ADDRESS   SVAPTE   PTE       TYPE  PROT  BITS PAGTYP  LOC STATE TYPE REFCNT   BAK       SVAPTE  FLINK  BLINK
   .
   .
   .
80068400  80777B08 7C40FFC8  STX   UR       K
80068600  80777B0C 7C40FFC8  STX   UR       K
80068800  80777B10 7C40FFC8  STX   UR       K
80068A00  80777B14 7C40FFC8  STX   UR       K
80068C00  80777B18 7C40FFC8  STX   UR       K
80068E00  80777B1C 7C40FFC8  STX   UR       K
80069000  80777B20 7C40FFC8  STX   UR       K
80069200  80777B24 7C40FFC8  STX   UR       K
80069400  80777B28 7C40FFC8  STX   UR       K
80069600  80777B2C 7C40FFC8  STX   UR       K
80069800  80777B30 7C40FFC8  STX   UR       K
80069A00  80777B34 780016C9  TRANS UR       K SYSTEM FREELST 00   01    0   0040FFC8   80777B34  03AF  0E15
80069C00  80777B38 78000E15  TRANS UR       K SYSTEM FREELST 00   01    0   0040FFC8   80777B38  16C9  2592
-------- 40 NULL PAGES
   .
   .
   .

9.3 Locating the Source of the Exception

Because the printer went off line and then came back on line, as shown on the console listing in Section 9.2, the problem might exist in the driver code. SDA can help you determine which driver might contain the faulty code.

9.3.1 Finding the Driver by Using the Program Counter

The first step in determining whether the failing instruction is within a driver is to examine the PC in the signal array using the EXAMINE/INSTRUCTION command. This has two results:

In the following example, the instruction that caused the exception is located within the printer driver.


SDA> EXAMINE/INSTRUCTION 8005D003
LPDRIVER+2B3   MOVB    (R3)+,(R0)

If SDA is unable to find a symbol within FFF16 bytes of the memory location you specify, it displays the location as an absolute address. This often, but not always, means the instruction that caused the exception is not part of a device driver.

To determine whether an instruction is part of a driver, use the SHOW DEVICE command to display the starting addresses and lengths of all the drivers in the system. If the address of the failing instruction falls within the range of addresses shown for a given driver, the failing instruction is a part of that driver. The following example shows a partial list of the drivers in the display generated by the SHOW DEVICE command.


I/O data structures 
 
                           DDB list 
                           -------- 
 
    Address    Controller     ACP       Driver      DPT   DPT size 
    -------    ----------     ---       ------      ---   -------- 
 
    80000ECC    HELIUM$DBA    F11XQP    DBDRIVER   800F7AD0  08FD 
    80001040    OPA                     OPERATOR   80001622  0061 
    8000126C    MBA                     MBDRIVER   800015B0  0578 
    80001460    NLA                     NLDRIVER   800015E9  05A3 
    801E2800    HELIUM$DMA    F11XQP    DMDRIVER   800B5CB0  0AA0 
    801E2980    HELIUM$DLA    F11XQP    DLDRIVER   800B6A50  08D0 
   .
   .
   .

9.3.2 Calculating the Offset into the Driver's Program Section

The offsets that SDA displays from nnDRIVER are actually offsets from the DPT. As such, these offsets do not exactly correspond to the offsets shown in driver listings, which represent offsets from the beginning of the program section (PSECT) in which a given instruction appears. Because a driver usually contains more than one PSECT, you must use the driver's map to determine the location of the failing instruction within the driver listing.

To calculate the location of the instruction within the driver listing, refer to the "Program Section Synopsis" section of the driver's map. Determine in which PSECT the offset given by SDA occurs and subtract the base of the PSECT from the offset. You can then use the resulting figure as an index into the driver listing.

If SDA does not display the address as an offset from nnDRIVER, but the address is within the address range of a driver in the SHOW DEVICE display, you must first subtract the address of the DPT from the failing address. Using the result as the offset, you can then follow the steps previously outlined for determining the index of the instruction into a driver listing.

9.4 Finding the Problem Within the Routine

To find the problem within the routine, examine the printer's driver code. In the system failure discussed in this example, the instruction that caused the exception is MOVB (R3)+,(R0). To check the contents of R3, use the EXAMINE command as follows:


SDA> EXAMINE R3
R3: 80069E00 "...."

The invalid virtual address, as recorded in the signal array, is stored in R3. In the following driver code excerpt, the instruction in question appears at line 599. It is likely that the contents of R3 have been incremented too many times.


581 STARTIO: 
582      MOVL    UCB$L_IRP(R5),R3     ;Retrieve address of I/O packet 
583      MOVW    IRP$L_MEDIA+2(R3),- 
584              UCB$W_BOFF(R5)       ;Set number of characters to print 
585      MOVL    UCB$L_SVAPTE(R5),R3  ;Get address of system buffer 
586      MOVAB   12(R3),R3            ;Get address of data area 
587      MOVL    UCB$L_CRB(R5),R4     ;Get address of CRB 
588      MOVL    @CRB$L_INTD+VEC$L_IDB(R4),R4 ;Get device CSR address 
589 ; 
590 ; START NEXT OUTPUT SEQUENCE 
591 ; 
592 
593 10$: ADDL3   #LP_DBR,R4,R0        ;Calculate address of data buffer register 
594      MOVZWL  UCB$W_BOFF(R5),R1    ;Get number of characters remaining 
595      MOVW    #^X8080,R2           ;Get control register test mask 
596      BRB     25$                  ;Start output 
597 20$: BITW    R2,(R4) (1)           ;Printer ready or have paper problem? 
598      BLEQ    30$                  ;If LEQ not ready or paper problem 
599      MOVB    (R3)+,(R0) (2)        ;Output next character 
600      ASHL    #1,G^EXE$GL_UBDELAY,-(SP)    ;Delay 3*2 u-seconds 
601 24$: SOBGEQ  (SP),24$             ;Delay loop calibrated to machine speed 
602      ADDL    #4,SP                ;Pop extra longword off stack 
603 25$: SOBGEQ  R1,20$ (3)            ;Any more characters to output? 
604      BRW     70$                  ;All done, BRW to set return status 

Explanations of the circled numbers in the example are in Section 9.4.1.


Previous Next Contents Index

[Site home] [Send comments] [Help with this site] [How to order documentation] [OpenVMS site] [Compaq site]
[OpenVMS documentation]

Copyright © Compaq Computer Corporation 1998. All rights reserved.

Legal
4556PRO_003.HTML