Guide to DECthreads

Document revision date: 19 July 1999

Guide to DECthreads

Contents

Index

3.6.4 Signaling a Condition Variable

When you are signaling a condition variable and that signal might cause the condition variable to be deleted, signal or broadcast the condition variable with the mutex locked.

The following C code fragment is executed by a releasing thread (Thread A) to wake a blocked thread:

pthread_mutex_lock (m); ... /* Change shared variables to allow another thread to proceed */ predicate = TRUE; pthread_mutex_unlock (m); (1) pthread_cond_signal (cv); (2)

The following C code fragment is executed by a potentially blocking thread (thread B):

pthread_mutex_lock (m); while (!predicate ) pthread_cond_wait (cv, m); pthread_mutex_unlock (m); pthread_cond_destroy (cv);

If thread B is allowed to run while thread A is at this point, it finds the predicate true and continues without waiting on the condition variable. Thread B might then delete the condition variable with the pthread_cond_destroy() routine before thread A resumes execution.
When thread A executes this statement, the condition variable does not exist and the program fails.

These code fragments also demonstrate a race condition; that is, the routine, as coded, depends on a sequence of events among multiple threads, but does not enforce the desired sequence. Signaling the condition variable while still holding the associated mutex eliminates the race condition. Doing so prevents thread B from deleting the condition variable until after thread A has signaled it.

This problem can occur when the releasing thread is a worker thread and the waiting thread is a boss thread, and the last worker thread tells the boss thread to delete the variables that are being shared by boss and worker.

Code the signaling of a condition variable with the mutex locked as follows:

pthread_mutex_lock (m); ... /* Change shared variables to allow some other thread to proceed */ pthread_cond_signal (cv); pthread_mutex_unlock (m);

3.6.5 Static Initialization Inappropriate for Stack-Based Synchronization Objects

Although it is acceptable to the compiler, it is inappropriate to use the following POSIX.1c standard macros to initialize DECthreads synchronization objects that are allocated on the stack:

PTHREAD_MUTEX_INITIALIZER
PTHREAD_COND_INITIALIZER
PTHREAD_RWLOCK_INITIALIZER

Each thread synchronization object is intended to be shared among the program's threads. If such an object is allocated on the stack, its address can asynchronously become invalid when the thread returns or otherwise terminates. For this reason, Compaq does not recommend allocating any thread synchronization object on the stack.

DECthreads detects some cases of misuse of static initialization of automatically allocated (stack-based) thread synchronization objects. For instance, if the thread on whose stack a statically initialized mutex is allocated attempts to access that mutex, the operation fails and returns [EINVAL]. If the application does not check status returns from DECthreads routines, this failure can remain unidentified. Further, if the operation was a call to pthread_mutex_lock(), the program can encounter a thread synchronization failure, which in turn can result in unexpected program behavior including memory corruption. (For performance reasons, DECthreads does not currently detect this error when a statically initialized mutex is accessed by a thread other than the one on whose stack the object was automatically allocated.)

If your application must allocate a thread synchronization object on the stack, the application must initialize the object before it is used by calling one of the routines pthread_mutex_init(), pthread_cond_init(), or pthread_rwlock_init(), as appropriate for the object. Your application must also destroy the thread synchronization object before it goes out of scope (for instance, due to the routine's returning control or raising an exception) by calling one of the routines pthread_mutex_destroy(), pthread_cond_destroy(), or pthread_rwlock_destroy(), as appropriate for the object.

3.7 Granularity Considerations

Granularity refers to the smallest unit of storage (that is, bytes, words, longwords, or quadwords) that a host computer can load or store in one machine instruction. Granularity considerations can affect the correctness of a program in which concurrent or asynchronous access can occur to data objects stored in the same memory granule. This can occur in a multithreaded program, where different threads access data objects in the same memory granule, or in a single-threaded program that has any of the following characteristics:

Accesses data objects in memory that is shared with other processes
Accesses data objects that can be accessed by asynchronous device drivers, signal handlers (on DIGITAL UNIX or Windows NT for Alpha), or ASTs (on OpenVMS)
Accesses data objects that can be accessed by a continuable exception handler

The subsections that follow explain the granularity concept, why it can affect the correctness of a multithreaded program, and techniques the programmer can use to prevent the granularity-related race condition known as word tearing.

3.7.1 Determinants of a Program's Granularity

A computer's processor typically makes available some set of granularities to programs, based on the processor's architecture, cache architecture, and instruction set. However, the computer's natural granularity also depends on the organization of the computer's memory and its bus architecture. For example, even if the processor "naturally" reads and writes 8-bit memory granules, a program's memory transfers may, in fact, occur in 32- or 64-bit memory granules.

On a computer that supports a set of granularities, the compiler determines a given program's actual granularity by the instructions it produces for the program to execute. For example, a given compiler on Alpha AXP systems might generate code that causes every memory access to load or store a quadword, regardless of the size of the data object specified in the application's source code. In this case, the application has a quadword actual granularity. For this application, 8-bit, 16-bit, and 32-bit writes are not atomic with respect to other memory operations that overlap the same 64-bit memory granule.

To provide a run-time environment for applications that is consistent and coherent, an operating system's services and libraries should be built so that they provide the same actual granularity. When this is the case, an operating system can be said to provide a system granularity to the applications that it hosts. (A system's system granularity is typically reflected in the default actual granularity that the system's compilers encodes when producing an object file.)

When preparing to port a multithreaded application from one system to another, you should determine whether there is a difference in the system granularities between the source and target systems. If the target system has a larger system granularity than the source system, you should become informed about the programming techniques presented in the sections that follow.

3.7.1.1 Alpha AXP Processor Granularity

Systems based on the Alpha AXP processor family have a quadword (64-bit) natural granularity.

Versions EV4 and EV5 of the Alpha AXP processor family provide instructions for only longword- and quadword-length atomic memory accesses. Newer Alpha AXP processors (EV5.6 and later) support byte- and word-length atomic memory accesses as well as longword- and quadword-length atomic memory accesses.

Note

On systems using DIGITAL UNIX Version 4.0 and later:

If you use DEC C or DEC C++ to compile your application's modules on a system that uses the EV4 or EV5 version of the Alpha AXP processor, you can use the -arch56 compiler switch to cause the compiler to produce instructions available in the Alpha AXP processor version EV5.6 or later, including instructions for byte- and word-length atomic memory access, as needed.

When an application compiled with the -arch56 switch runs under DIGITAL UNIX Version 4.0 or later with a newer Alpha AXP processor (that is, EV5.6 or later), it utilizes that processor's full instruction set. When that same application runs under DIGITAL UNIX Version 4.0 or later with an older Alpha AXP processor (that is, EV4 or EV5), the operating system performs a software emulation of each instruction that is not available to the older processor.

See the DEC C and DEC C++ compiler documentation for more information about the -arch56 switch.

On DIGITAL UNIX systems, use the /usr/sbin/psrinfo -v command to determine the version(s) of your system's Alpha AXP processor(s).

3.7.1.2 VAX Processor Granularity

Systems based on the VAX processor family have longword (32-bit) natural granularity.

For more information about the granularity considerations of porting an application from an OpenVMS VAX system to an OpenVMS Alpha systems, consult the document Migrating to an OpenVMS AXP System in the OpenVMS documentation set.

3.7.2 Compiler Support for Determining the Program's Actual Granularity

Table 3-1 summarizes the actual granularities that are provided by the respective compilers on the respective Compaq platforms.

Table 3-1 Default and Optional Granularities
Platform Compiler Default Granularity Setting Optional Granularity Settings

DIGITAL UNIX Version 4.0D (Alpha only) C/C++ quadword None

OpenVMS Alpha Version 7.2 C/C++ quadword byte, word, longword

OpenVMS VAX Version 7.2 C/C++ longword byte, word

Windows NT Version 3.51 (Alpha only) C/C++ longword byte, word

**Table 3-1 Default and Optional Granularities**
Platform	Compiler	Default Granularity Setting	Optional Granularity Settings
DIGITAL UNIX Version 4.0D (Alpha only)	C/C++	quadword	None
OpenVMS Alpha Version 7.2	C/C++	quadword	byte, word, longword
OpenVMS VAX Version 7.2	C/C++	longword	byte, word
Windows NT Version 3.51 (Alpha only)	C/C++	longword	byte, word

Of course, for compilers that support an optional granularity setting, it is possible to compile different modules in your application with different granularity settings. You might do so to avoid the possibility of word-tearing race condition, as described below, or to improve the application's performance.

3.7.3 Word Tearing

In a multithreaded application, concurrent access by different threads to data objects that occupy the same memory granule can lead to a race condition known as word tearing. This situation occurs when two or more threads independently read the same granule of memory, update different portions of that granule, then independently (that is, asynchronously) store their respective copies of that granule. Because the order of the store operations is indeterminate, only the last thread to write the granule continues with a correct "view" of the granule's contents.

In a multithreaded program the potential for a word-tearing race condition exists only when both of the following conditions are met:

Two or more threads can concurrently write distinct data objects that occupy the same memory granule G, where G is a byte, word, longword, or quadword.
The application's actual granularity is sizeof(G) or larger.

For instance, given a multithreaded program that has been compiled to have longword actual granularity, if any two of the program's threads can concurrently update different bytes or words in the same longword, then that program is, in theory, at risk for encountering a word-tearing race condition. However, in practice, language-defined restrictions on the alignments of data objects limits the actual number of candidates for a word-tearing scenario, as described in the next section.

3.7.4 Alignments of Members of Composite Data Objects

The only data objects that are candidates for participating in a word-tearing race condition are members of composite data objects---that is, C language structures, unions, and arrays. In other words, the application's threads might access data objects that are members of structures or unions, where those members occupy the same byte, word, longword, or quadword. Similarly, the application might access arrays whose elements occupy the same word, longword, or quadword, or whose elements are themselves composite data objects whose members can do so.

On the other hand, the C language specification allows the compiler to allocate scalar data objects so that each is aligned on a boundary for the memory granule that the compiler prefers, as follows:

For DEC C and DEC C++ on DIGITAL UNIX Version 4.0D (Alpha only) and OpenVMS Alpha Version 7.2 systems, alignment of scalars is always on quadword boundaries.
For DEC C and DEC C++ on OpenVMS VAX Version 7.2 and Windows NT Version 3.51 (Alpha only) systems, alignment of scalars is always on longword boundaries.

For the details of the compiler's rules for aligning scalar and composite data objects, see the DEC C and C++ compiler documentation for your application's host platforms.

3.7.5 Avoiding Granularity-Related Errors

Compaq recommends that you inspect your multithreaded application's code to determine whether a word-tearing race condition is possible for any two or more of the application's threads. That is, determine whether any two or more threads can concurrently write contiguously defined members of the same composite data object where those members occupy the same memory granule whose size is greater than or equal to the application's actual granularity.

If you find that you must change your application to avoid a word-tearing scenario, there are several approaches available. The simplest techniques require only that you change the definition of the target composite data object before recompiling the application. The following sections offers some suggestions.

3.7.5.1 Changing the Composite Data Object's Layout

If you can change the organization or layout of the composite data object's definition, you should do both of the following:

Define padding storage after each structure or union member (except the last) or add padding storage to the array's element definition. This forces all members/elements to be explicitly aligned on a unit that conforms to the application's actual granularity when compiled.
And,
(If your system's compiler offers a choice) Compile the application's modules to produce the preferred actual granularity for the application's target system.

3.7.5.2 Maintaining the Composite Data Object's Layout

If you cannot change the organization or layout of the composite data object's definition, you should do one of the following:

(On OpenVMS Alpha, OpenVMS VAX, or Windows NT for Alpha systems) Compile all application modules for byte actual granularity. Doing so automatically prevents word-tearing race conditions for structure or union members and array elements of size byte or larger that are accessed concurrently by different threads. No other program modification is required.
Or,
(On DIGITAL UNIX systems) For arrays, add the C language volatile storage qualifier to the definition of the entire array; for structures, add volatile to the declaration of only those members that share the pertinent memory granule. You must also compile the application's modules using the DEC C or DEC C++ compiler's -strong-volatile switch. Doing so causes the compiler to produce code that forces all accesses to those members to occur as atomic operations. See the description of the -strong-volatile switch in the DEC C or DEC C++ documentation and on the cc reference page.

If you must maintain the composite data object's layout and cannot change the storage qualifiers for the application's composite objects, you can instead use the technique described in the next section.

3.7.5.3 Using One Mutex Per Composite Data Object

If your source code inspection identified an array or a set of contiguously defined structure or union members that is subject to a word-tearing race condition, the program can use a mutex that is dedicated to protect all write accesses by all threads to those data objects, rather than change the definition of the composite data objects.

To use this technique, create a separate mutex for each composite data object where any members share a memory granule that is greater than or equal to the program's actual granularity. For example, given an application with quadword actual granularity, if structure members M1 and M2 occupy the same longword in structure S and those members can be written concurrently by more than one thread, then the application must create and reserve a mutex for use only to protect all write accesses by all threads to those two members.

In general, this is a less desirable technique due to performance considerations. However, if the absolute number of thread accesses to the target data objects over the application's run-time will be small, this technique provides explicit, portable correctness for all thread accesses to the target members.

3.7.6 Identifying Possible Word-Tearing Situations Using Visual Threads

For DIGITAL UNIX systems, the Visual Threads tool can warn the developer at application run-time that a possible word-tearing situation has been detected. Enable the UnguardedData rule before running the application. This rule causes Visual Threads to track whether any memory location in the application has been accessed using the Load Locked...Store Conditional pair of Alpha AXP instructions, then later accessed using a normal Load...Store instruction pair. See the Visual Threads product's online help for more information.

Visual Threads is available as part of the Developer's Toolkit for DIGITAL UNIX.

3.8 One-Time Initialization

Your program might have one or more routines that must be executed before any thread executes code in your facility, but that must be executed only once, regardless of the sequence in which threads start executing. For example, your program can initialize mutexes, condition variables, or thread-specific data keys---each of which must be created only once---in a one-time initialization routine.

Use the pthread_once() routine to ensure that your program's initialization routine executes only once---that is, by the first thread that attempts to initialize your program's resources. Multiple threads can call the pthread_once() routine, and DECthreads ensures that the specified routine is called only once.

On the other hand, rather than use the pthread_once() routine, your program can statically initialize a mutex and a flag, then simply lock the mutex and test the flag. In many cases, this technique might be more straightforward to implement.

Finally, you can use implicit (and nonportable) initialization mechanisms, such as OpenVMS LIB$INITIALIZE, DIGITAL UNIX dynamic loader __init_ code, or Win32 DLL initialization handlers for Windows NT and Windows 95.

3.9 Managing Dependencies Upon Other Libraries

Because multithreaded programming has become common only recently, many existing code libraries are incompatible with multithreaded routines. For example, many of the traditional C run-time library routines maintain state across multiple calls using static storage. This storage can become corrupted if routines are called from multiple threads at the same time. Even if the calls from multiple threads are serialized, code that depends upon a sequence of return values might not work.

For example, the UNIX getpwent(2) routine returns the entries in the password file in sequence. If multiple threads call getpwent(2) repeatedly, even if the calls are serialized, no thread can obtain all entries in the password file.

Library routines might be compatible with multithreaded programming to different extents. The important distinctions are thread reentrancy and thread safety.

3.9.1 Thread Reentrancy

A routine is thread reentrant if it performs correctly despite being called simultaneously or sequentially by different threads. For example, the standard C run-time library routine strtok() can be made thread reentrant most efficiently by adding an argument that specifies a context for the sequence of tokens. Thus, multiple threads can simultaneously parse different strings without interfering with each other.

The ideal thread-reentrant routine has no dependency on static data. Because static data must be synchronized using mutexes and condition variables, there is always a performance penalty due to the time required to lock and unlock the mutex and also in the loss of potential parallelism throughout the program. A routine that does not use any data that is shared between threads can proceed without locking.

If you are developing new interfaces, make sure that any persistent context information (like the last-token-returned pointer in strtok()) is passed explicitly so that multiple threads can process independent streams of information independently. Return information to the caller through routine values, output parameters (where the caller passes the address and length of a buffer), or by allocating dynamic memory and requiring the caller to free that memory when finished. Try to avoid using errno for returning error or diagnostic information; use routine return values instead.

3.9.2 Thread Safety

A routine is thread safe if it can be called simultaneously from multiple threads without risk of corruption. Generally this means that it does some simple level of locking (perhaps using the DECthreads global lock) to prevent simultaneously active calls in different threads. See Section 3.9.3.3 for information about the DECthreads global lock.

Thread-safe routines might be inefficient. For example, a UNIX stdio package that is thread safe might still block all threads in the process while waiting to read or write data to a file.

Routines such as localtime(3) or strtok(), which traditionally rely on static storage, can be made thread safe by using thread-specific data instead of static variables. This prevents corruption and avoids the overhead of synchronization. However, using thread-specific data is not without its own cost, and it is not always the best solution. Using an alternate, reentrant version of the routine, such as the POSIX strtok_r() interface, is preferable.

Contents

Index

privacy and legal statement

6101PRO_007.HTML