Tải bản đầy đủ
Figure 3.4 Solaris Locks — The Big Picture

Figure 3.4 Solaris Locks — The Big Picture

Tải bản đầy đủ

Mutex Locks


The structure shown above provides for the object type declaration. For each
synchronization object type, a type-specific structure is defined: mutex_sobj_ops
for mutex locks, rw_sobj_ops for reader/writer locks, and sema_sobj_ops for
The structure also provides three functions that may be called on behalf of a
kthread sleeping on a synchronization object:
• An owner function, which returns the ID of the kernel thread that owns the
• An unsleep function, which transitions a kernel thread from a sleep state
• A change_pri function, which changes the priority of a kernel thread, used for
priority inheritance. (See “Turnstiles and Priority Inheritance” on page 89.).
We will see how references to the lock’s operations structure is implemented as we
move through specifics on lock implementations in the following sections.
It is useful to note at this point that our examination of Solaris kernel locks
offers a good example of some of the design trade-offs involved in kernel software
engineering. Building the various software components that make up the Solaris
kernel is a series of design decisions, when performance needs are measured
against complexity. In areas of the kernel where optimal performance is a top priority, simplicity might be sacrificed in favor of performance. The locking facilities
in the Solaris kernel are an area where such trade-offs are made—much of the lock
code is written in assembly language, for speed, rather than in the C language; the
latter is easier to code with and maintain but is potentially slower. In some cases,
when the code path is not performance critical, a simpler design will be favored
over cryptic assembly code or complexity in the algorithms. The behavior of a particular design is examined through exhaustive testing, to ensure that the best possible design decisions were made.


Mutex Locks
Mutual exclusion, or mutex locks, are the most common type of synchronization
primitive used in the kernel. Mutex locks serialize access to critical data, when a
kernel thread must acquire the mutex specific to the data region being protected
before it can read or write the data. The thread is the lock owner while it is holding the lock, and the thread must release the lock when it has finished working in
the protected region so other threads can acquire the lock for access to the protected data.


Kernel Synchronization Primitives

3.5.1 Overview
If a thread attempts to acquire a mutex lock that is being held, it can basically do
one of two things: it can spin or it can block. Spinning means the thread enters a
tight loop, attempting to acquire the lock in each pass through the loop. The term
spin lock is often used to describe this type of mutex. Blocking means the thread is
placed on a sleep queue while the lock is being held and the kernel sends a wakeup
to the thread when the lock is released. There are pros and cons to both
The spin approach has the benefit of not incurring the overhead of context
switching, required when a thread is put to sleep, and also has the advantage of a
relatively fast acquisition when the lock is released, since there is no context-switch operation. It has the downside of consuming CPU cycles while the
thread is in the spin loop—the CPU is executing a kernel thread (the thread in the
spin loop) but not really doing any useful work.
The blocking approach has the advantage of freeing the processor to execute
other threads while the lock is being held; it has the disadvantage of requiring context switching to get the waiting thread off the processor and a new runnable
thread onto the processor. There’s also a little more lock acquisition latency, since a
wakeup and context switch are required before the blocking thread can become the
owner of the lock it was waiting for.
In addition to the issue of what to do if a requested lock is being held, the question of lock granularity needs to be resolved. Let’s take a simple example. The kernel maintains a process table, which is a linked list of process structures, one for
each of the processes running on the system. A simple table-level mutex could be
implemented, such that if a thread needs to manipulate a process structure, it
must first acquire the process table mutex. This level of locking is very coarse. It
has the advantages of simplicity and minimal lock overhead. It has the obvious
disadvantage of potentially poor scalability, since only one thread at a time can
manipulate objects on the process table. Such a lock is likely to have a great deal of
contention (become a hot lock).
The alternative is to implement a finer level of granularity: a lock-per-process
table entry versus one table-level lock. With a lock on each process table entry,
multiple threads can be manipulating different process structures at the same
time, providing concurrency. The disadvantages are that such an implementation
is more complex, increases the chances of deadlock situations, and necessitates
more overhead because there are more locks to manage.
In general, the Solaris kernel implements relatively fine-grained locking whenever possible, largely due to the dynamic nature of scaling locks with kernel structures as needed.
The kernel implements two types of mutex locks: spin locks and adaptive locks.
Spin locks, as we discussed, spin in a tight loop if a desired lock is being held when
a thread attempts to acquire the lock. Adaptive locks are the most common type of

Mutex Locks


lock used and are designed to dynamically either spin or block when a lock is being
held, depending on the state of the holder. We already discussed the trade-offs of
spinning versus blocking. Implementing a locking scheme that only does one or the
other can severely impact scalability and performance. It is much better to use an
adaptive locking scheme, which is precisely what we do.
The mechanics of adaptive locks are straightforward. When a thread attempts
to acquire a lock and the lock is being held, the kernel examines the state of the
thread that is holding the lock. If the lock holder (owner) is running on a processor, the thread attempting to get the lock will spin. If the thread holding the lock is
not running, the thread attempting to get the lock will block. This policy works
quite well because the code is such that mutex hold times are very short (by
design, the goal is to minimize the amount of code to be executed while a lock is
held). So, if a thread is holding a lock and running, the lock will likely be released
very soon, probably in less time than it takes to context-switch off and on again, so
it’s worth spinning.
On the other hand, if a lock holder is not running, then we know that minimally one context switch is involved before the holder will release the lock (getting
the holder back on a processor to run), and it makes sense to simply block and free
up the processor to do something else. The kernel will place the blocking thread on
a turnstile (sleep queue) designed specifically for synchronization primitives and
will wake the thread when the lock is released by the holder. (See “Turnstiles and
Priority Inheritance” on page 89.)
The other distinction between adaptive locks and spin locks has to do with interrupts, the dispatcher, and context switching. The kernel dispatcher is the code that
selects threads for scheduling and does context switches. It runs at an elevated
Priority Interrupt Level (PIL) to block interrupts (the dispatcher runs at priority
level 10 on SPARC systems). High-level interrupts (interrupt levels 11–15 on
SPARC systems) can interrupt the dispatcher. High-level interrupt handlers are
not allowed to do anything that could require a context switch or to enter the dispatcher (we discuss this further in “Dispatcher Locks” on page 97). Adaptive locks
can block, and blocking means context switching, so only spin locks can be used in
high-level interrupt handlers. Also, spin locks can raise the interrupt level of the


Kernel Synchronization Primitives

processor when the lock is acquired. Interrupts are covered in more detail in “Kernel Traps and Exceptions” later in this chapter.
struct kernel_data {
kmutex_t klock;
char *forw_ptr;
char *back_ptr;
long data1;
int data2;
} kdata;
void function()
klock.data1 = 1;

The preceding block of pseudocode illustrates the general mechanics of mutex
locks. A lock is declared in the code; in this case, it is embedded in the data structure it is designed to protect. Once declared, the lock is initialized with the kernel
mutex_init() function. Any subsequent reference to the kdata structure
requires that the klock mutex be acquired with mutex_enter(). Once the work
is done, the lock is released with mutex_exit(). The lock type, spin or adaptive,
is determined in the mutex_init() code by the kernel. Assuming an adaptive
mutex in this example, any kernel threads that make a mutex_enter() call on
klock will either block or spin, depending on the state of the kernel thread that
owns klock when the mutex_enter() is called.

3.5.2 Solaris 7 Mutex Lock Implementation
The implementation description in this section is based on Solaris 7. Algorithmically, Solaris 2.5.1 and Solaris 2.6 are very similar but have some implementation
differences, which we cover in the sections that follow.
The kernel defines different data structures for the two types of mutex locks,
adaptive and spin, as shown in Figure 3.5.



Bit 0 is the
“waiters” bit.


Figure 3.5 Solaris 7 Adaptive and Spin Mutex

Mutex Locks


In Figure 3.5, the m_owner field in the adaptive lock, which holds the address of
the kernel thread that owns the lock (the kthread pointer), plays a double role, in
that it also serves as the actual lock; successful lock acquisition for a thread means
it has its kthread pointer set in the m_owner field of the target lock. If threads
attempt to get the lock while it is held (waiters) the low-order bit (bit 0) of
m_owner is set to reflect that case. We ensure that kthread pointer values do not
require bit 0 to make this work.
The spin mutex, as we pointed out earlier, is used at high interrupt levels,
where context switching is not allowed. Spin locks block interrupts while in the
spin loop, so the kernel needs to maintain the priority level the processor was running at prior to entering the spin loop, which raises the processor’s priority level.
(Elevating the priority level is how interrupts are blocked.) The m_minspl field
stores the priority level of the interrupt handler when the lock is initialized, and
m_oldspl gets set to the priority level the processor was running at when the lock
code is called. The m_spinlock field are the actual mutex lock bits.
Each kernel module and subsystem implementing one or more mutex locks calls
into a common set of mutex functions. All locks must first be initialized by the
mutex_init() function, where the lock type is determined on the basis of an
argument passed in the mutex_init() call. The most common type passed into
mutex_init() is MUTEX_DEFAULT, which results in the init code determining
what type of lock, adaptive or spin, should be used. It is possible for a caller of
mutex_init() to be specific about a lock type (e.g., MUTEX_SPIN), but that is
rarely done.
If the init code is called from a device driver or any kernel module that registers and generates interrupts, then an interrupt block cookie is added to the argument list. An interrupt block cookie is an abstraction used by device drivers when
they set their interrupt vector and parameters. The mutex_init() code checks
the argument list for an interrupt block cookie. If mutex_init() is being called
from a device driver to initialize a mutex to be used in a high-level interrupt handler, the lock type is set to spin. Otherwise, an adaptive lock is initialized. The test
is the interrupt level in the passed interrupt block; levels above 10 (On SPARC
systems) are considered high-level interrupts and thus require spin locks. The
init code clears most of the fields in the mutex lock structure as appropriate for
the lock type. The m_dummylock field in spin locks is set to all 1’s (0xFF). We’ll see
why in a minute.
The primary mutex functions called, aside from mutex_init() (which is only
called once for each lock at initialization time), are mutex_enter() to get a lock
and mutex_exit() to release it. mutex_enter() assumes an available, adaptive
lock. If the lock is held or is a spin lock, mutex_vector_enter() is entered to
reconcile what should happen. This is a performance optimization.
mutex_enter() is implemented in assembly code, and because the entry point is
designed for the simple case (adaptive lock, not held), the amount of code that gets
executed to acquire a lock when those conditions are true is minimal. Also, there
are significantly more adaptive mutex locks than spin locks in the kernel, making


Kernel Synchronization Primitives

the quick test case effective most of the time. The test for a lock held or spin lock is
very fast. Here is where the m_dummylock field comes into play. mutex_enter()
executes a compare-and-swap instruction on the first byte of the mutex, testing for
a zero value. On a spin lock, the m_dummylock field is tested because of its positioning in the data structure and the endianness of SPARC processors. Since
m_dummylock is always set (it is set to all 1’s in mutex_init()), the test will fail
for spin locks. The test will also fail for a held adaptive lock since such a lock will
have a nonzero value in the byte field being tested. That is, the m_owner field will
have a kthread pointer value for a held, adaptive lock.
If the lock is an adaptive mutex and is not being held, the caller of
mutex_enter() gets ownership of the lock. If the two conditions are not true, that
is, either the lock is held or the lock is a spin lock, the code enters the
mutex_vector_enter() code first tests the lock type. For spin locks, the
m_oldspl field is set, based on the current Priority Interrupt Level (PIL) of the
processor, and the lock is tested. If it’s not being held, the lock is set (m_spinlock)
and the code returns to the caller. A held lock forces the caller into a spin loop,
where a loop counter is incremented (for statistical purposes; the lockstat(1M)
data), and the code checks whether the lock is still held in each pass through the
loop. Once the lock is released, the code breaks out of the loop, grabs the lock, and
returns to the caller.
Adaptive locks require a little more work. When the code enters the adaptive
cpu_sysinfo.mutex_adenters (adaptive lock enters) field, as is reflected in the
smtx column in mpstat(1M). mutex_vector_enter() then tests again to determine if the lock is owned (held), since the lock may have been released in the time
interval between the call to mutex_enter() and the current point in the
mutex_vector_enter() code. If the adaptive lock is not being held,
mutex_vector_enter() attempts to acquire the lock. If successful, the code
If the lock is held, mutex_vector_enter() determines whether or not the lock
owner is running by looping through the CPU structures and testing the lock
m_owner against the cpu_thread field of the CPU structure. (cpu_thread contains the kernel thread address of the thread currently executing on the CPU.) A
match indicates the holder is running, which means the adaptive lock will spin. No
match means the owner is not running, in which case the caller must block. In the
blocking case, the kernel turnstile code is entered to locate or acquire a turnstile,
in preparation for placement of the kernel thread on a sleep queue associated with
the turnstile.
mutex_vector_enter() determines that the lock holder is not running, it makes
a turnstile call to look up the turnstile sets the waiters bit in the lock, and
retests to see if the owner is running. If yes, the code releases the turnstile and
enters the adaptive lock spin loop, which attempts to acquire the lock. Otherwise,

Mutex Locks


the code places the kernel thread on a turnstile (sleep queue) and changes the
thread’s state to sleep. That effectively concludes the sequence of events in
Dropping out of mutex_vector_enter(), either the caller ended up with the
lock it was attempting to acquire, or the calling thread is on a turnstile sleep
queue associated with the lock. In either case, the lockstat(1M) data is updated,
reflecting the lock type, spin time, or sleep time as the last bit of work done in
lockstat(1M) is a kernel lock statistics command that was introduced in
Solaris 2.6. It provides detailed information on kernel mutex and reader/writer
The algorithm described in the previous paragraphs is summarized in
pseudocode below.
if (lock is a spin lock)
lock_set_spl() /* enter spin-lock specific code path */
increment cpu_sysinfo.ademters.
if (lock is not owned)
mutex_trylock() /* try to acquire the lock */
if (lock acquired)
goto bottom
continue /* lock being held */
if (lock owner is running on a processor)
goto spin_loop
lookup turnstile for the lock
set waiters bit
if (lock owner is running on a processor)
drop turnstile
goto spin_loop
block /* the sleep queue associated with the turnstile */
update lockstat statistics

When a thread has finished working in a lock-protected data area, it calls the
mutex_exit() code to release the lock. The entry point is implemented in assembly language and handles the simple case of freeing an adaptive lock with no waiters. With no threads waiting for the lock, it’s a simple matter of clearing the lock
fields (m_owner) and returning. The C language function mutex_vector_exit()
is entered from mutex_exit() for anything but the simple case.
In the case of a spin lock, the lock field is cleared and the processor is returned
to the PIL level it was running at prior to entering the lock code. For adaptive
locks, a waiter must be selected from the turnstile (if there is more than one
waiter), have its state changed from sleeping to runnable, and be placed on a dispatch queue so it can execute and get the lock. If the thread releasing the lock was


Kernel Synchronization Primitives

the beneficiary of priority inheritance, meaning that it had its priority improved
when a calling thread with a better priority was not able to get the lock, then the
thread releasing the lock will have its priority reset to what it was prior to the
inheritance. Priority inheritance is discussed in “Turnstiles and Priority Inheritance” on page 89.
When an adaptive lock is released, the code clears the waiters bit in m_owner
and calls the turnstile function to wake up all the waiters. Readers familiar with
sleep/wakeup mechanisms of operating systems have likely heard of a particular
behavior known as the “thundering herd problem,” a situation where many threads
that have been blocking for the same resource are all woken up at the same time
and make a mad dash for the resource (a mutex in this case)—like a herd of large,
four-legged beasts running toward the same object. System behavior tends to go
from a relatively small run queue to a large run queue (all the threads have been
woken up and made runnable) and high CPU utilization until a thread gets the
resource, at which point a bunch of threads are sleeping again, the run queue normalizes, and CPU utilization flattens out. This is a generic behaviour that can
occur on any operating system.
The wakeup mechanism used when mutex_vector_exit() is called in Solaris
7 may seem like an open invitation to thundering herds, but in practice it turns
out not to be a problem. The main reason is that the blocking case for threads
waiting for a mutex is rare; most of the time the threads will spin. If a blocking situation does arise, it typically does not reach a point where very many threads are
blocked on the mutex—one of the characteristics of the thundering herd problem is
resource contention resulting in a lot of sleeping threads. The kernel code segments that implement mutex locks are, by design, short and fast, so locks are not
held for long. Code that requires longer lock-hold times uses a reader/writer write
lock, which provides mutual exclusion semantics with a selective wakeup algorithm. There are, of course, other reasons for choosing reader/writer locks over
mutex locks, the most obvious being to allow multiple readers to see the protected
In the following sections, we differentiate the implementation of mutexes in
Solaris 2.5.1 and 2.6. As we said, Solaris 2.5.1 and 2.6 are similar to the previously described Solaris 7 behavior, especially in the area of lock acquisition
(mutex_enter()). The more salient differences exist in the lock release algorithm, and associated wakeup behavior. Solaris 2.6 Mutex Implementation Differences
First, an examination of the lock structures, as illustrated in Figure 3.6.

Mutex Locks






Figure 3.6 Solaris 2.6 Mutex
Solaris 2.6 defines all possible mutex lock types within the same structure. The
spin lock is the same as for Solaris 7, with the addition of a type field. The adaptive mutex has more fields, which are fairly self-descriptive. m_owner_lock is the
same as m_owner is Solaris 7; it is the lock itself, and the value represents the
kthread ID of the holder when the lock is held. m_waiters stores the turnstile ID
of a waiting kernel thread, and m_wlock is a dispatcher lock (see “Dispatcher
Locks” on page 97) that synchronizes access to the m_waiters field in the mutex.
m_type describes the lock type (adaptive or spin).
The structure differences aside, the basic algorithm and implementation of
assembly and C code are essentially the same in Solaris 2.6 for lock acquisition
(mutex_enter()). The differences exist in the sleep mechanism—the turnstile
implementation has been improved in Solaris 7. These differences are in the interfaces and subroutines called to facilitate turnstile allocation and priority inheritance. (See “Turnstiles and Priority Inheritance” on page 89.)
On the lock release side, Solaris 2.6 implements separate routines for releasing
mutex_adaptive_release(), respectively. When an adaptive lock with waiting
threads is released, the 2.6 code wakes only the thread with the highest priority. If
multiple threads of the same (highest) priority are waiting for the same lock, the
thread waiting longest is selected. Solaris 2.5.1 Mutex Implementation Differences
The data structure of Solaris 2.5.1 looks very much like that of Solaris 2.6. It has
another possible instantiation of an adaptive lock, m_adaptive2, which combines
the lock owner and lock bits in the same field of the adaptive mutex data structure, as illustrated in Figure 3.7.


Kernel Synchronization Primitives







Figure 3.7 Solaris 2.5.1 Adaptive Mutex
The implementation of the mutex code in Solaris 2.5.1 is quite different from the
other releases, although algorithmically the behaviour is essentially the same.
Solaris 2.5.1 defines a mutex operations vector array. Each element in the array is
a mutex_ops structure, one structure for each type of mutex defined in Solaris
2.5.1. The array is indexed according to the lock type and the function call. The
design is such that specific functions handle lock operations for each of the different types of mutex locks. Common entry points into the mutex code pass control to
the lock-specific functions by switching through the operations array, as determined by the lock type and function offset. This is illustrated in Figure 3.8.
mutex_init(..., SPIN,...)
Calculate array
offset based on
lock type and function
in generic mutex_init()
entry point.


does the
actual work.

Figure 3.8 Solaris 2.5.1 Mutex Operations Vectoring
Solaris 2.5.1 implements a wakeup method similar to that of Solaris 2.6 when an
adaptive mutex is released with waiting threads; the highest priority waiter is

Mutex Locks


selected for wakeup and should get ownership of the lock. It is, of course, possible
for another thread to come along and acquire the lock before the wakeup is completed, so the newly awakened thread is not guaranteed to get the lock.
Finally, note that the lockstat utility was first implemented in Solaris 2.6 and
thus is not part of the 2.5.1 release. Why the Mutex Changes in Solaris 7
We wanted to avoid getting mired in subtle differences that would not add real
value to the text, but at the same time we want to point out relevant differences in
implementations. The subsections describing the implementation differences
across releases serve two purposes. The first and most obvious is completeness, to
meet our goal of covering multiple Solaris releases. Second, and even more compelling, we show the evolution and refinements that have gone into the lock code from
release to release. What we see is something that is functionally similar but significantly scaled down in size. The actual lock manipulation and turnstiles code follow the same trend shown in the data structures—scaled down, leaner, and more
The rationale for changing the wakeup behaviour in Solaris 7 stems from
exhaustive testing and examination of all possible options; then, the designers
selected what works best most of the time.
To summarize, when a mutex lock is being released with waiters, there are
essentially three options.
• Choose a waiting thread from the turnstile and give it lock ownership before
waking it. This approach is known as direct handoff of ownership.
This approach has a downside when there is even moderate lock contention,
in that it defeats the logic of adaptive locks. If direct handoff is used, there is
a window between the time the thread was given ownership and the time the
new owner is running on a processor (it must go through a state change and
get context-switched on). If, during that window, other threads come along
trying to get the lock, they will enter the blocking code path, since the “is the
lock owner running” test will fail. A lock with some contention will result in
most threads taking the blocking path instead of the spin path.
• Free the lock and wake one waiter, as Solaris 2.5.1 and Solaris 2.6 do.
This option has a much more subtle complication because of the potential for
a lock to have no owner, but multiple waiters. Consider a lock having multiple waiters. If the lock is released and the code issues a wakeup on one
waiter, the lock will have no owner (it was just released). On a busy system, it
could take several seconds for the selected thread to be context-switched and
take ownership of the lock. If another thread comes along and grabs the lock,
the mutex_enter() code must examine the blocking chain (all the threads
sitting on a turnstile waiting for the lock) to determine if it should inherit a
higher priority from a sleeping thread. This complicates the mutex_enter()
code, putting an unnecessary burden on a hot code path.