Tải bản đầy đủ
3 Data Locality, Bandwidth, and Latency

3 Data Locality, Bandwidth, and Latency

Tải bản đầy đủ

11.3 DATA LOCALITY, BANDWIDTH, AND LATENCY

327

Example 11.11 Program to Sum a Series of Numbers
#include "timing.h"
#define SIZE 2*1024*1024
#define RPT 100
double array[SIZE];
int main()
{
int index,count;
double totalf;
for (index=0; indextotalf=0;
starttime();
for (count=0; count{
for (index=0;indextotalf+=array[index];
totalf=totalf*5.7;
}
endtime(SIZE*RPT);
return (int)totalf;
}

the application takes about 24ns per iteration without prefetch, and 5ns per iteration with prefetch.
Example 11.12 Code Run with and without Prefetch
% cc -xO3 -xtarget=ultra3 -xprefetch=no ex11.11.c
% time a.out
Time per iteration 23.86 ns
% cc -xO3 -xtarget=ultra3 -xprefetch stream_13.c
% time a.out
Time per iteration 5.14 ns

11.3.2 Integer Data
Prefetch also makes a significant difference for integer data. However, for the
UltraSPARC III family of processors, integer data can be fetched only into the

328

Chapter 11

Source Code Optimizations

second-level cache (not the on-chip cache). Example 11.13 shows example code
that streams integer data.
Example 11.13 Streaming Integer Data
#include "timing.h"
#define SIZE 4*1024*1024
#define RPT 100
int array[SIZE];
int main()
{
int index,count;
int totalf;
for (index=0; indextotalf=0;
starttime();
for (count=0; count{
for (index=0;indextotalf+=array[index];
totalf=totalf*5.7;
}
endtime(SIZE*RPT);
return totalf;
}

Example 11.14 shows the results of compiling this code with and without
prefetch. The code takes 13ns per iteration without prefetch compared to 4ns with
prefetch. Rather surprisingly, the integer code runs faster than the floating-point
code because the latency of the floating-point add and load instructions is four
cycles. In comparison, an integer load takes two cycles, and an integer add only
one cycle. It is the latency of these floating-point instructions that significantly
determines the performance of the loop.
Example 11.14 Streaming Integer Data with and without Prefetch
% cc -xO3 -xtarget=ultra3 -xprefetch=no time ex11.13.c
% a.out
Time per iteration 12.90 ns
% cc -xO3 -xtarget=ultra3 -xprefetch ex11.13.c
% a.out
Time per iteration 3.81 ns

11.3 DATA LOCALITY, BANDWIDTH, AND LATENCY

329

11.3.3 Storing Streams
Performance can also be improved with prefetch when storing streams of data, as
shown in Example 11.15.
Example 11.15 Streams of Stored Data
#include "timing.h"
#define SIZE 2*1024*1024
#define RPT 100
double array[SIZE];
int main()
{
int index,count;
double totalf;
for (index=0; indextotalf=0;
starttime();
for (count=0; count{
for (index=0;index{
array[index]=totalf;
totalf+=index;
}
totalf=totalf*5.7;
}
endtime(SIZE*RPT);
return (int)totalf;
}

Example 11.16 shows the results of compiling the test code for storing streams
of data. With prefetch the code takes 10ns per iterations; without prefetch it takes
30ns per iteration.
Example 11.16 Results of Storing a Stream of Data with and without Prefetch
% cc -xO3 -xtarget=ultra3 -xprefetch=no ex11.15.c
% a.out
Time per iteration 29.30 ns
% cc -xO3 -xtarget=ultra3 -xprefetch ex11.15.c
% a.out
Time per iteration 9.76 ns

330

Chapter 11

Source Code Optimizations

11.3.4 Manual Prefetch
Sometimes the compiler is unable to insert prefetches into the code. This may happen where a loop contains if statements or other control flow. It may also happen
where the access pattern is apparent to the developer because of some characteristic of the application, but that characteristic is not apparent to the compiler; perhaps a linked list has a generally predictable access pattern.
In Fortran, this is done using pragmas inserted into the source code. For C/C++,
it is necessary to include the sun_prefetch.h header file. In Sun Studio 12, the
manual prefetch for the x86 was included in the header. The platform-agnostic versions of the functions start with sun rather than sparc. Table 11.1 summarizes
the actions of the various prefetch types.

Table 11.1 Manual Prefetch
Construct

Description

sparc_prefetch_read_once(
)
sun_prefetch_read_once(
)
$PRAGMA SPARC_PREFETCH_READ_ONCE (address)

Fetch data to be read once. Try to
avoid polluting the caches with
the data.

sparc_prefetch_read_many(
)
sun_prefetch_read_many(
)
$PRAGMA SPARC_PREFETCH_READ_MANY (address)

Fetch data to be read multiple
times. Install data in caches.

sparc_prefetch_write_once(
)
sun_prefetch_write_once(
)
$PRAGMA SPARC_PREFETCH_WRITE_ONCE (address)

Fetch cache line for it to be written
to once

sparc_prefetch_write_many(
)
sun_prefetch_write_many(
)
$PRAGMA SPARC_PREFETCH_WRITE_MANY (address)

Fetch cache line for it to be written
to multiple times

You need to be careful to ensure that manually inserted prefetch instructions
are actually prefetching the correct address. Consequently, it is often worth checking the disassembly for the section of code containing the manual prefetch call.
For the code shown in Example 11.17, Sun Studio 12 does not insert prefetch
instructions at -O unless the flag -xprefetch_level=2 is also specified.

11.3 DATA LOCALITY, BANDWIDTH, AND LATENCY

331

Example 11.17 Code Where the Compiler Does Not Insert Prefetches
#include "timing.h"
#define SIZE 2*1024*1024
#define RPT 100
double array[SIZE];
int main()
{
int index,count;
double totalf;
for (index=0; indextotalf=0;
starttime();
for (count=0; count{
for (index=0;index{
if (array[index]>0) totalf+=array[index];
}
totalf=totalf*5.7;
}
endtime(SIZE*RPT);
return (int)totalf;
}

Example 11.18 shows analyzer output for the hot part of the code. The time is
being attributed to the floating-point branch instruction, although it is caused by
the load instruction at 0x10ee4.
Example 11.18 Analyzer Output for the Hot Part of the Code
0.
## 6.124
0.040
0.
0.040
0.060
0.
0.030

10ec8:
10ecc:
10ed0:
10ed4:
10ed8:
10edc:
10ee0:
10ee4:

fcmped
fbule,pn
inc
faddd
inc
cmp
ble,a,pt
ldd

%fcc0, %f2, %f6
%fcc0, 0x10ed8
8, %l7
%f8, %f2, %f8
%l6
%l6, %l0
%icc,0x10ec8
[%l7], %f2

Having identified which load instruction is missing cache, and the source line
that generates this load instruction, it is relatively easy to add a manual prefetch

332

Chapter 11

Source Code Optimizations

to get the data ready for the load. The source code in Example 11.19 has been modified to contain the appropriate manual prefetch code.
Example 11.19 Modified Source Showing Manual Prefetch Insertion
#include "timing.h"
#include
#define SIZE 2*1024*1024
#define RPT 100
double array[SIZE];
int main()
{
int index,count;
double totalf;
for (index=0; indextotalf=0;
starttime();
for (count=0; count{
for (index=0;index{
sparc_prefetch_read_many(&array[index+16]);
if (array[index]>0) totalf+=array[index];
}
totalf=totalf*5.7;
}
endtime(SIZE*RPT);
return (int)totalf;
}

Example 11.20 shows the analyzer output for the modified source code with the
manually inserted prefetch.
Example 11.20 Analyzer Output for the Modifed Code
0.060
0.
## 2.542
0.020
0.
0.060
0.
0.100
0.
0.

10ed4:
10ed8:
10edc:
10ee0:
10ee4:
10ee8:
10eec:
10ef0:
10ef4:
10ef8:

ldd
fcmped
fbule,pn
inc
faddd
inc
inc
cmp
ble,pt
prefetch

[%l6], %f2
%fcc0, %f2, %f6
%fcc0, 0x10ee8
8, %l6
%f8, %f2, %f8
%l5
8, %l7
%l5, %l0
%icc,0x10ed4
[%l7], #n_reads

As mentioned previously, the compiler is able to insert prefetches for this loop
when it is compiled with the options -O and -xprefetch_level=2.
Example 11.21 shows the performance from the two variants of the code under

11.3 DATA LOCALITY, BANDWIDTH, AND LATENCY

333

different levels of prefetch insertion. The original code with no prefetch takes
30ns per iteration; when manual prefetch is added this time is reduced to 13ns
per iteration. However, when the compiler adds prefetch instructions for this
loop, it is able to achieve 10.5ns per iteration. So, given the right options, the compiler is able to do better than a manually inserted prefetch instruction.
Example 11.21 Comparison of Compiler and Manual Prefetch Insertion
% cc -xO3 -xtarget=ultra3 -xprefetch ex11.17.c
% a.out
Time per iteration 29.84 ns
% cc -xO3 -xtarget=ultra3 -xprefetch ex11.19.c
% a.out
Time per iteration 13.11 ns
% cc -xO3 -xtarget=ultra3 -xprefetch -xprefetch_level=2 ex11.17.c
% a.out
Time per iteration 10.50 ns

11.3.5 Latency
Latency is a measure of how long it takes to fetch a single item of data from memory. Code to measure latency can be hard to write because it has to ensure the following.
The data has to be entirely resident in a particular level of cache. The simplest way to achieve this is to make the data set slightly smaller than the
level of cache being measured and to iterate through every element in the
data set before repeating the access to any element.
Each access must bring in a different cache line to always be fetching new
data from memory; otherwise, the latency calculated will be some average of
cache miss and cache hit costs.
You must write the access pattern such that the compiler cannot detect a regular stride pattern and add prefetch instructions to reduce the latency. The
easiest way to achieve this goal is to make the loop a pointer chasing loop. In
this way, the compiler cannot have insight into the next address except from
loading the current address.
In the presence of hardware stride predictors, the memory access pattern
needs to be sufficiently complex to avoid the hardware prefetch being able to
predict the next location. One way to achieve this is to randomize the
addresses used. The degree of complexity needed in the software will depend
on the complexity of the hardware prefetch unit.

334

Chapter 11

Source Code Optimizations

The code shown in Example 11.22 sets up a simple test of memory latency. A
linked list is traversed, each location is a new fetch from memory, and the latency
is measured as the time per iteration of the inner loop. Because the UltraSPARC
IIICu processor that this test has been run on does not have a hardware prefetch
unit that fetches data from memory, there is no need to complicate the linked list.
Notice that at the end of the code there is a redundant operation on the pointer to
ensure that the compiler is unable to optimize out of the loop.
Example 11.22 Test of Memory Latency
#include
#include "timing.h"
#define SIZE 1024*1024*2*2
#define ITERATIONS 1024*1024*2*2
static int** array[SIZE];
void main()
{
int i;
int **j;
for (i=0; iarray[i]=(int**)&array[i+16];
for (i=0; i<16; i++)
array[SIZE-1-i]=(int**)&array[i];
starttime();
j=array[0];
for (i=0; i<4*ITERATIONS; i++)
{
j=(int **)*j;
}
endtime(4*ITERATIONS);
if (j==0) {printf("Null element\n");}
}

Example 11.23 shows the results of running the program in Example 11.22 on a
two-processor 1056MHz UltraSPARC IIICu system. This indicates that the memory latency is about 150ns.
Example 11.23 Latency Results
% cc -xO3 -xprefetch -xtarget=ultra3 -xmemalign=8s ex11.22.c
% a.out
Time per iteration 155.15 ns

Looking at the program, it is possible to determine that the stride pattern is
completely predictable, and having determined this, it is possible to add prefetches
for the appropriate distance ahead. In the first instance, the next iteration is going
to be at an offset of 16 elements (i.e., 16*4 = 64 bytes) from the current iteration.

11.3 DATA LOCALITY, BANDWIDTH, AND LATENCY

335

You can modify the code as shown in Example 11.24 to take advantage of this
information.
Example 11.24 Adding Prefetch to the Latency Test
#include
#include "timing.h"
#include
#define SIZE 1024*1024*2*2
#define ITERATIONS 1024*1024*2*2
static int** array[SIZE];
void main()
{
int i;
int **j;
for (i=0; iarray[i]=(int**)&array[i+16];
for (i=0; i<16; i++)
array[SIZE-1-i]=(int**)&array[i];
starttime();
j=array[0];
for (i=0; i<4*ITERATIONS; i++)
{
j=(int **)*j;
sparc_prefetch_read_many(j+16);
}
endtime(4*ITERATIONS);
if (j==0) {printf("Null element\n");}
}

Now when the code is compiled, the latency is noticieably reduced, as shown in
Example 11.25.
Example 11.25 Running the Latency Code with Prefetch Inserted
% cc -xO3 -xprefetch -xtarget=ultra3 -xmemalign=8s ex11.24.c
% a.out
Time per iteration 98.08 ns

Inserting prefetch for the next cache line does improve performance, but it does
not achive the best possible performance because at any one time two cache line
accesses at most will be fetched from memory. Fetching a cache line that is farther
ahead should improve performance, as more prefetches will be active at the same
time, and each prefetch will have more time to retrieve the data before it is
needed. The data presented in Table 11.2 shows the changes in performance as the
prefetch-ahead distance is modified.

336

Chapter 11

Source Code Optimizations

Table 11.2 Memory Latency As a Function of Number of Cache Lines
Prefetched Ahead
Prefetch Ahead-Cache Lines

Measured Memory Latency

None

155ns

1

98ns

2

69ns

3

55ns

4

47ns

5

43ns

6

39ns

7

38ns

8

44ns

9

52ns

As you might expect, the big gains in performance come from emitting
prefetches for one or two cache lines ahead, and that further prefetches give
diminishing gains. One way to look at these results is that the degree of memory-level parallelism (the amount of data that is being fetched from memory
simultaneously) is being increased as more cache lines are prefetched in parallel. In effect, the latency bound code is being converted to one that is bound by
bandwidth.
You can confirm this by calculating the bandwidth both initially and when
prefetches are emitted for two cache lines ahead. Initially, the code attains one
cache line every 155ns, so a bandwidth of 64 bytes * 1,000,000,000ns/155ns =
394MB per second. When the code is prefetching for two cache lines ahead the
bandwidth is 64 bytes * 1,000,000,000ns / 69ns = 885MB per second—a doubling of
the utilized bandwidth.
The other situation to consider is when it is possible to determine that some
degree of pointer chasing is going on, and it is not possible to determine the pattern at compile time but it is predictable at runtime. The code shown in
Example 11.26 demonstrates one way to issue speculative prefetches at runtime.
In this simplified case, the code is the latency test program, and consequently, the
memory access pattern is very predictable. However, rather than doing a static
speculation of where the next memory access might be, the code uses a simple
stride predictor which assumes that the accesses are regularly spaced. The stride
predictor attempts to predict for two accesses in advance.

11.3 DATA LOCALITY, BANDWIDTH, AND LATENCY

337

Example 11.26 Speculative Stride Prediction
#include
#include "timing.h"
#include
#define SIZE 1024*1024*2*2
#define ITERATIONS 1024*1024*2*2
static int** array[SIZE];
void main()
{
int i;
int **j;
int ** old;
for (i=0; iarray[i]=(int**)&array[i+16];
for (i=0; i<16; i++)
array[SIZE-1-i]=(int**)&array[i];
starttime();
j=array[0];
old = j;
for (i=0; i<4*ITERATIONS; i++)
{
j=(int **)*j;
sparc_prefetch_read_many(j + 2*(j-old));
old = j;
}
endtime(4*ITERATIONS);
if (j==0) {printf("Null element\n");}
}

It is worth observing that one successful prefetch will save significant memory
access time, and if the code that emits the prefetch adds only a few cycles, even a
moderately successful prediction scheme can improve performance.
Consider a system where a successful prefetch will save 150 cycles, and where
the code to emit the prefetch takes ten cycles. After N prefetches, the prefetch
scheme has cost 10*N, but it has successfully prefetched R cache lines, which
results in a savings of 150*R cycles. When 150*R > 10*N the prefetch scheme has
improved performance. The success rate (R/N) of the prefetch scheme has to be
greater than 1/15 (or 7%).
Of course, there are downsides to speculatively emitting prefetch.
Emitting prefetches takes up instruction slots which could be usefully used
by the rest of the code. As pointed out earlier, this can be factored into the
decision of whether to emit the prefetches.
The speculative prefetches may knock useful data out of the caches.
The speculative prefetches may use system bandwidth that could be better
used by another thread running on the processor.