Tải bản đầy đủ
It uses the PAGE_HASH_SEARCH macro, shown below, to search the list referenced by the slot for...

It uses the PAGE_HASH_SEARCH macro, shown below, to search the list referenced by the slot for...

Tải bản đầy đủ


Solaris Memory Architecture

of the machine-specific page structure are hidden from the generic kernel—only
the HAT machine-specific layer can see or manipulate its contents. Figure 5.19
shows how each page structure is embedded in a machine-dependent struct
struct page

HAT information about this page’s
translation to physical memory

Software copies of page reference and modified bits

Figure 5.19 Machine-Specific Page Structures: sun4u Example

The machine-specific page contains a pointer to the HAT-specific mapping information, and information about the page’s HAT state is stored in the machine-specific
machpage. The store information includes bits that indicate whether the page has
been referenced or modified, for use in the page scanner (covered later in the chapter). Both the machine-independent and machine-dependent page structures share
the same start address in memory, so a pointer to a page structure can be cast to a
pointer to a machine-specific page structure (see Figure 5.19). Macros for converting between machine-independent pages and machine-dependent page structures
make the cast.

5.7.4 Physical Page Lists
The Solaris kernel uses a segmented global physical page list, consisting of segments of contiguous physical memory. (Many hardware platforms now present
memory in noncontiguous groups.) Contiguous physical memory segments are
added during system boot. They are also added and deleted dynamically when

Global Page Management


physical memory is added and removed while the system is running. Figure 5.20
shows the arrangement of the physical page lists into contiguous segments.

struct memseg

Physical Page List

struct memseg

Physical Page List

Figure 5.20 Contiguous Physical Memory Segments Free List and Cache List
The free list and the cache list hold pages that are not mapped into any address
space and that have been freed by page_free(). The sum of these lists is
reported in the free column in vmstat. Even though vmstat reports these pages
as free, they can still contain a valid page from a vnode/offset and hence are still
part of the global page cache. Pages that are caching files in the page cache can
appear on the free list. Memory on the cache list is not really free, it is a valid
cache of a page from a file. The cache list exemplifies how the file systems use
memory as a file system cache.
The free list contains pages that no longer have a vnode and offset associated
with them—which can only occur if the page has been destroyed and removed from
a vnode’s hash list. The free list is generally very small, since most pages that are
no longer used by a process or the kernel still keep their vnode/offset information
intact. Pages are put on the free list when a process exits, at which point all of the
anonymous memory pages (heap, stack, and copy-on-write pages) are freed.
The cache list is a hashed list of pages that still have mappings to valid vnode
and offset. Recall that pages can be obtained from the cache list by the


Solaris Memory Architecture

page_lookup() routine. This function accepts a vnode and offset as the argument and returns a page structure. If the page is found on the cache list, then the
page is removed from the cache list and returned to the caller. When we find and
remove pages from the cache list, we are reclaiming a page. Page reclaims are
reported by vmstat in the “re” column.

5.7.5 The Page-Level Interfaces
The Solaris virtual memory system implementation has grouped page management and manipulation into a central group of functions. These functions are used
by the segment drivers and file systems to create, delete, and modify pages. The
major page-level interfaces are shown in Table 5-10.
Table 5-10 Solaris 7 Page Level Interfaces






Creates pages. Page coloring is based on a hash of
the vnode offset. page_create() is provided for
backward compatibility only. Don’t use it if you don’t
have to. Instead, use the page_create_va() function so that pages are correctly colored.
Creates pages, taking into account the virtual
address they will be mapped to. The address is used
to calculate page coloring.
Tests that a page for vnode/offset exists.
Searches the hash list for a page with the specified
vnode and offset that is known to exist and is
already locked.
Finds the first page on the global page hash list.
Frees a page. Pages with vnode/offset go onto the
cache list; other pages go onto the free list.
Checks whether a page is on the free list
Checks whether a page is modified. This function
checks only the software bit in the page structure. To
sync the MMU bits with the page structure, you may
need to call hat_pagesync() before calling
Checks whether a page has been referenced; checks
only the software bit in the page structure. To sync
the MMU bits with the page structure, you may need
to call hat_pagesync() before calling
Checks whether a page is shared across more than
one address space.

Global Page Management


Table 5-10 Solaris 7 Page Level Interfaces (Continued)

Finds a page representing the specified vnode/offset.
If the page is found on a free list, then it will be
removed from the free list.
page_lookup_nowait() Finds a page representing the specified vnode/offset
that is not locked or on the free list.
Informs the VM system we need some pages freed
up. Calls to page_needfree() must be symmetric,
that is they must be followed by another
page_needfree() with the same amount of memory
multiplied by -1,after the task is complete.
Finds the next page on the global page hash list.
The page_create_va() function allocates pages. It takes the number of pages to
allocate as an argument and returns a page list linked with the pages that have
been taken from the free list. page_create_va() also takes a virtual address as
an argument so that it can implement page coloring (discussed in Section 5.7.8,
“Page Coloring,” on page 174). The new page_create_va() function subsumes
the older page_create() function and should be used by all newly developed subsystems because page_create() may not correctly color the allocated pages.

5.7.6 The Page Throttle
Solaris implements a page creation throttle so a small core of memory is available
for consumption by critical parts of the kernel. The page throttle, implemented in
the page_create() and page_create_va() functions, causes page creates to
block when the PG_WAIT flag is specified, that is, when available is less than the
system global, throttlefree. By default, the system global parameter, throttlefree, is set to the same value as the system global parameter minfree. By
default, memory allocated through the kernel memory allocator specifies PG_WAIT
and is subject to the page-created throttle. (See Section 6.2, “Kernel Memory Allocation,” on page 212 for more information on kernel memory allocation.)

5.7.7 Page Sizes
The Solaris kernel uses a fundamental page size that varies according to the
underlying hardware. On UltraSPARC and beyond, the fundamental page size is 8
Kbytes. The hardware on which Solaris runs has several different types of memory management units, which support a variety of page sizes, as listed in Table


Solaris Memory Architecture

Table 5-11 Page Sizes on Different Sun Platforms
System Type

Early SPARC systems
microSPARC-I, -II
Intel x86 architecture


MMU Page
4K, 4M
4K, 64K, 512K, 4M
4K, 4M

Solaris 2.x
Page Size
4K, 4M
8K, 4M
4K, 4M

The optimal MMU page size is a trade-off between performance and memory size
efficiency. A larger page size has less memory management overhead and hence
better performance, but a smaller page size wastes less memory (memory is
wasted when a page is not completely filled). (See “Large Pages” on page 200 for
further information on large pages.)

5.7.8 Page Coloring
Some interesting effects result from the organization of pages within the processor caches, and as a result, the page placement policy within these caches can dramatically affect processor performance. When pages overlay other pages in the
cache, they can displace cache data that we might not want overlaid, resulting in
less cache utilization and “hot spots.”
The optimal placement of pages in the cache often depends on the memory
access patterns of the application; that is, is the application accessing memory in a
random order, or is it doing some sort of strided ordered access? Several different
algorithms can be selected in the Solaris kernel to implement page placement; the
default attempts to provide the best overall performance.
To understand how page placement can affect performance, let’s look at the
cache configuration and see when page overlaying and displacement can occur. The
UltraSPARC-I and -II implementations use virtually addressed L1 caches and
physically addressed L2 caches. The L2 cache is arranged in lines of 64 bytes, and
transfers are done to and from physical memory in 64-byte units. Figure 5.27 on
page 194 shows the architecture of the UltraSPARC-I and -II CPU modules with
their caches. The L1 cache is 16 Kbytes, and the L2 (external) cache can vary
between 512 Kbytes and 8 Mbytes. We can query the operating system with adb to
see the size of the caches reported to the operating system. The L1 cache sizes are

Global Page Management


recorded in the vac_size parameter, and the L2 cache size is recorded in the
ecache_size parameter.
# adb -k
physmem 7a97



We’ll start by using the L2 cache as an example of how page placement can affect
performance. The physical addressing of the L2 cache means that the cache is
organized in page-sized multiples of the physical address space, which means that
the cache effectively has only a limited number of page-aligned slots. The number
of effective page slots in the cache is the cache size divided by the page size. To
simplify our examples, let’s assume we have a 32-Kbyte L2 cache (much smaller
than reality), which means that if we have a page size of 8 Kbytes, there are four
page-sized slots on the L2 cache. The cache does not necessarily read and write
8-Kbyte units from memory; it does that in 64-byte chunks, so in reality our
32-Kbyte cache has 1024 addressable slots. Figure 5.21 shows how our cache
would look if we laid it out linearly.
eCache Offset









Physical Address Offset Mapping
Figure 5.21 Physical Page Mapping into a 64-Kbyte Physical Cache
The L2 cache is direct-mapped from physical memory. If we were to access physical addresses on a 32-Kbyte boundary, for example, offsets 0 and 32678, then both
memory locations would map to the same cache line. If we were now to access
these two addresses, we cause the cache lines for the offset 0 address to be read,
then flushed (cleared), the cache line for the offset 32768 address to be read in, and
then flushed, then the first reloaded, etc. This ping-pong effect in the cache is
known as cache flushing (or cache ping-ponging), and it effectively reduces our performance to that of real-memory speed, rather than cache speed. By accessing


Solaris Memory Architecture

memory on our 32-Kbyte cache-size boundary, we have effectively used only 64
bytes of the cache (a cache line size), rather than the full cache size. Memory is
often up to 10–20 times slower than cache and so can have a dramatic effect on
Our simple example was based on the assumption that we were accessing physical memory in a regular pattern, but we don’t program to physical memory; rather,
we program to virtual memory. Therefore, the operating system must provide a
sensible mapping between virtual memory and physical memory; otherwise, effects
such as our example can occur.
By default, physical pages are assigned to an address space from the order in
which they appear in the free list. In general, the first time a machine boots, the
free list may have physical memory in a linear order, and we may end up with the
behavior described in our “ping pong” example. Once a machine has been running,
the physical page free list will become randomly ordered, and subsequent reruns of
an identical application could get very different physical page placement and, as a
result, very different performance. On early Solaris implementations, this is
exactly what customers saw—differing performance for identical runs, as much as
30 percent difference.
To provide better and consistent performance, the Solaris kernel uses a page coloring algorithm when pages are allocated to a virtual address space. Rather than
being randomly allocated, the pages are allocated with a specific predetermined
relationship between the virtual address to which they are being mapped and their
underlying physical address. The virtual-to-physical relationship is predetermined as follows: the free list of physical pages is organized into specifically colored bins, one color bin for each slot in the physical cache; the number of color bins
is determined by the ecache size divided by the page size. (In our example, there
would be exactly four colored bins.)
When a page is put on the free list, the page_free() algorithms assign it to a
color bin. When a page is consumed from the free list, the virtual-to-physical algorithm takes the page from a physical color bin, chosen as a function of the virtual
address which to which the page will be mapped. The algorithm requires that
when allocating pages from the free list, the page create function must know the
virtual address to which a page will be mapped.
New pages are allocated by calling the page_create_va() function. The
page_create_va() function accepts the virtual address of the location that the
page is going to be mapped as an argument; then, the virtual-to-physical color bin
algorithm can decide which color bin to take physical pages from. The

Global Page Management


page_create_va() function is described with the page management functions in
Table 5-10 on page 204.
Note: The page_create_va() function deprecates the older page_create()
function. We chose to add a new function rather than adding an additional
argument the existing page_create() function so that existing third party loadable kernel modules which call page_create()remain functional. However,
because page_create() does not know about virtual addresses, it has to pick
a color at random - which can cause significant performance degredation. The
page_create_va() function should always be used for new code.
No one algorithm suits all applications because different applications have different memory access patterns. Over time, the page coloring algorithms used in the
Solaris kernel have been refined as a result of extensive simulation, benchmarks,
and customer feedback. The kernel supports a default algorithm and two optional
algorithms. The default algorithm was chosen according to the following criteria:
• Fairly consistent, repeatable results
• Good overall performance for the majority of applications
• Acceptable performance across a wide range of applications
The default algorithm uses a hashing algorithm to distribute pages as evenly as
possible throughout the cache. The default and three other available page coloring
algorithms are shown in Table 5-12.
Table 5-12 Solaris Page Coloring Algorithms
Hashed VA


P. Addr =
V. Addr


Bin Hopping


The physical page color bin
is chosen on a hashed algorithm to ensure even distribution of virtual addresses
across the cache.
The physical page color is
chosen so that physical
addresses map directly to
the virtual addresses (as in
our example).
Physical pages are allocated with a round-robin

Solaris Availability










Solaris Memory Architecture

Table 5-12 Solaris Page Coloring Algorithms
Best Bin


Solaris Availability

Kessler best bin algorithm.
Keeps history per process of
used colors and chooses
least used color; if multiple, use largest bin.







The Ultra Enterprise 10000 has a different default algorithm, which tries to
evenly distribute colors for each process’s address space so that no one color is
more used than another. This algorithm is correct most of the time, but in some
cases, the hashed or direct (0 or 1) algorithms can be better.
You can change the default algorithm by setting the system parameter
consistent_coloring, either on-the-fly with adb or permanently in /etc/system.
# adb -kw
physmem 7a97
consistent_coloring/W 1




So, which algorithm is best? Well, your mileage will vary, depending on your application. Page coloring usually only makes a difference on memory-intensive scientific applications, and the defaults are usually fine for commercial or database
systems. If you have a time-critical scientific application, then we recommend that
you experiment with the different algorithms and see which is best. Remember
that some algorithms will produce different results for each run, so aggregate as
many runs as possible.


The Page Scanner
The page scanner is the memory management daemon that manages systemwide
physical memory. The page scanner and the virtual memory page fault mechanism are the core of the demand-paged memory allocation system used to manage
Solaris memory. When there is a memory shortage, the page scanner runs, to steal
memory from address spaces by taking pages that haven’t been used recently,
syncing them up with their backing store (swap space if they are anonymous
pages), and freeing them. If paged-out virtual memory is required again by an

The Page Scanner


address space, then a memory page fault occurs when the virtual address is referenced and the pages are recreated and copied back from their backing store.
The balancing of page stealing and page faults determines which parts of virtual memory will be backed by real physical memory and which will be moved out
to swap. The page scanner does not understand the memory usage patterns or
working sets of processes; it only knows reference information on a physical
page-by-page basis. This policy is often referred to as global page replacement; the
alternative process-based page management, is known as local page replacement.
The subtleties of which pages are stolen govern the memory allocation policies
and can affect different workloads in different ways. During the life of the Solaris
kernel, only two significant changes in memory replacement policies have
• Enhancements to minimize page stealing from extensively shared libraries
and executables
• Priority paging to prevent application, shared library, and executable paging
on systems with ample memory
We discuss these changes in more detail when we describe page scanner implementation.

5.8.1 Page Scanner Operation
The page scanner tracks page usage by reading a per-page hardware bit from the
hardware MMU for each page. Two bits are kept for each page; they indicate
whether the page has been modified or referenced since the bits were last cleared.
The page scanner uses the bits as the fundamental data to decide which pages of
memory have been used recently and which have not.
The page scanner is a kernel thread, which is awakened when the amount of
memory on the free-page list falls below a system threshhold, typically 1/64th of
total physical memory. The page scanner scans through pages in physical page
order, looking for pages that haven’t been used recently to page out to the swap
device and free. The algorithm that determines whether pages have been used
resembles a clock face and is known as the two-handed clock algorithm. This algorithm views the entire physical page list as a circular list, where the last physical


Solaris Memory Architecture

page wraps around to the first. Two hands sweep through the physical page list, as
shown in Figure 5.22.

Write to swap
Clearing bit

Figure 5.22 Two-Handed Clock Algorithm
The two hands, the front hand and back hand, rotate clockwise in page order
around the list. The front hand rotates ahead of the back hand, clearing the referenced and modified bits for each page. The trailing back hand then inspects the
referenced and modified bits some time later. Pages that have not been referenced
or modified are swapped out and freed. The rate at which the hands rotate around
the page list is controlled by the amount of free memory on the system, and the
gap between the front hand and back hand is fixed by a boot-time parameter,

5.8.2 Page-out Algorithm and Parameters
The page-out algorithm is controlled by several parameters, some of which are calculated at system startup by the amount of memory in the system, and some of
which are calculated dynamically based on memory allocation and paging activity.
The parameters that control the clock hands do two things: they control the rate
at which the scanner scans through pages, and they control the time (or distance)
between the front hand and the back hand. The distance between the back hand
and the front hand is handspreadpages and is expressed in units of pages. The
maximum distance between the front hand and back hand defaults to half of memory and is capped at 8,192 pages, or 64 Mbytes. Systems with 128 Mbytes or more
of memory always default this distance to 8,192 pages, or 64 Mbytes. Scan Rate Parameters (Assuming No Priority Paging)
The scanner starts scanning when free memory is lower than lotsfree number of
pages free plus a small buffer factor, deficit. The scanner starts scanning at a
rate of slowscan pages per second at this point and gets faster as the amount of
free memory approaches zero. The system parameter lotsfree is calculated at
startup as 1/64th of memory, and the parameter deficit is either zero or a small
number of pages—set by the page allocator at times of large memory allocation to