This blog is based on a paper from Georgia Tech called SLAP.
The code for this blog can be found here.
Contents
- Pipelining Background
- Load Address Prediction
- Observing Load Address Prediction on M2 Processors
- Conclusion
- References
Pipelining
CPU pipelining is a technique that modern CPUs employ to achieve instruction level parallelism. Pipelining means that each instruction is broken down into series of smaller steps. Many different instructions can be in the pipeline at the same time at different stages. In this way CPUs overlap execution of many instructions at a time. Modern CPU pipelines have many stages including instruction fetch, instruction decode, register allocation and renaming, uop dispatch, execution, and retirement.
Pipelining introduces some problems. The first problem is branch instructions in the code. Conditional branch instructions cause the program to go one of two ways (either the if statement is taken or it won't be taken). Depending on the code, it might take hundreds of cycles to know which direction the branch will take (the branch may depend on a load from memory that misses the CPU cache). During this time, the CPU doesn't know which code path to feed to the pipeline.
CPUs try to get around this by executing instructions speculatively. Speculative execution means that the CPU will try to predict which path the branch instruction will take. Instructions that are speculatively executed are fetched, decoded, and executed, but are only committed (retired) when the CPU confirms that it predicted correctly. If it mispredicts, the CPU flushes the pipeline and discards the work that it did. In this way, CPUs try to overcome branches: they guess which path will be taken and speculatively execute that path.
Another challenge that CPU pipelining causes comes from data dependencies in code. Consider the following x86 code:
mov rax, [rdi]
mov rdx, [rax]
The second mov instruction depends on the first mov instruction. The second mov instruction cannot issue until
rax is produced by the preceding mov instruction. This is called a Read After Write (RAW) dependency.
RAW dependencies are also called "true dependencies" because there is actually data flowing from the write to the read.
This kind of dependency typically causes the read instruction to stall until the write completes (the first mov in this case).
This limits the throughput of the CPU.
Load Address Prediction
A recent Apple patent [1] suggests a new technique for overcoming these dependencies. The patent writes
"architectural registers repeatedly used
as a destination register and subsequently as a source
register cause serialization of instruction execution for associated source code segments."
This is much like the x86 code fragment above. In that x86 code, rax is used as a destination register and then
as a source register. As a result, the second load cannot begin until the first load has completed.
Apple's patent proposes a CPU with a prediction table which executes load instructions "with a retrieved predicted address from the prediction table." In other words, the CPU attempts to predict where load will read from and will speculate based on that prediction.
Observing Load Address Prediction on M2 Processors
Consider the following C++ code:
uint64_t time_chase(volatile int* arr, int iters) {
volatile int dep = 0;
for (int i = 0; i < iters; ++i) {
dep = arr[dep];
}
dep = 0;
uint64_t start = get_time();
for (int i = 0; i < iters; ++i) {
dep = arr[dep];
}
uint64_t end = get_time();
return end - start;
}
This code runs a loop of RAW dependencies. Each iteration of the loop is a load dep = arr[dep] which depends
on the write from the previous loop iteration. The first loop is to ensure that the array is in the L1 cache of the CPU core.
We also ensure that the entire array is small enough that it can fit in the L1 cache of the CPU core.
Without the CPU predicting some data value,
you would expect each of these load from L1 instructions to execute serially, one after the other.
Each load would wait for the previous load to produce dep, so the loop's runtime should scale as
iterations × load latency.
On M2 processors, we actually observe something different. The runtime of this program depends on
the contents of the array arr and on which CPU core the program runs on. I filled the array in
two different ways:
- strided — each element points
spositions ahead, e.g.arr[i] = i + s, sodepwalks the array as0, s, 2s, 3s, … - random —
arris a random permutation of0..N-1
Both arrays fit in L1 and are warmed before measurement, so every load is an L1 hit. The only thing that changes between the two runs is the pattern of addresses visited. If the CPU is doing nothing special, the two should perform identically.
They do not perform identically. On the P-core, strided runs roughly twice as fast as
random.
On the E-core, strided and random perform almost identically.
The P-core is doing something that depends on the address stream. When the addresses form a predictable pattern, successive loads can issue in parallel instead of serializing on the prior load's result. This is the behavior the Apple patent describes. When the addresses are unpredictable, the predictor has nothing to latch onto and the loop falls back to serialized execution. The E-core, by contrast, appears to have no such predictor: its runtime is governed by load-use latency regardless of how addressable the pattern is.
This is already suggestive, but it does not yet distinguish address prediction from value prediction. In the strided case the loaded values are highly regular as well as the addresses, so either mechanism could explain the speedup.
To distinguish the two, we design a different experiment. In this experiment, we populate
the array with random values in the range [stride,array_size). We then execute the following code:
uint64_t time_sa_rv(volatile int* arr, int iters) {
volatile int dep = 0;
for (int i = 0; i < iters; ++i) {
dep += std::min(stride, arr[dep]);
}
dep = 0;
uint64_t start = get_time();
for (int i = 0; i < iters; ++i) {
dep += std::min(stride, arr[dep]);
}
uint64_t end = get_time();
return end - start;
}
In this code we again do a dry run first to make sure the array is in the L1 cache.
This experiment generates the exact same address stream as the strided experiment because
stride is less than all values in the array. And still, in this experiment,
we see a significant speedup over the random case!
On the P-core, sa_rv is much faster than random even though the values loaded at each step
are random. The only thing sa_rv and strided have in common is the sequence of
addresses visited. Therefore, that is what the predictor is latching onto.
On the E-core the two patterns still look pretty much the same, consistent with there being no
address predictor on the E-core at all.
Conclusion
In conclusion, we have observed that when reading from a predictable stream of addresses, Apple M2 processors on P-cores are able to execute dependent loads faster than would be expected if loads were serialized. This suggests that these processors are performing load address prediction. E-cores don't seem to be doing this same prediction.