There are two primary ways to search a sorted array for a given key: binary search and linear search.
You probably know that binary search is an O(logn) algorithm and linear search is an O(n) algorithm,
and therefore for large enough values of n, binary search will be always faster.
However, for small enough values of n, linear search can be faster than binary search.
The goal of this post is to answer when and why linear search can be faster than binary search.
The full code for this post can be found here.
Contents
- The Code
- Measurements
- Is This CPU specific?
- Is this Workload Dependent?
- What About Branch-Free Binary Search?
- Conclusion
The Code
In this post we analyze the following implementations of binary search and linear search (compiled with O3 and march=native on each platform). For completeness,
we will also compare these implementations with std::lower_bound.
size_t binary_search(const std::vector<long>& vec, long target) {
size_t low = 0, high = vec.size();
while (low < high) {
// Interestingly (low+high)/2 is measurably faster
// for small arrays than low + (high - low)/2
size_t mid = (low+high)/2;
if (vec[mid] < target) {
low = mid + 1;
} else {
high = mid;
}
}
return low;
}
size_t linear_search(const std::vector<long>& vec, long target) {
size_t idx = 0;
for (; idx < vec.size(); ++idx) {
if (vec[idx] >= target) {
break;
}
}
return idx;
}
Measurements
Using Google Benchmark, I measured the performance of the linear search and binary search on different array sizes. The following graph plots cycle counts on an AMD Ryzen 9 9900X (a Zen 5 CPU):
As expected, the runtime of binary search grows logarithmically and the runtime of linear search grows
linearly. However, for small arrays (fewer than 136 elements), linear search is actually faster than binary search! std::lower_bound
performs better than our binary search implementation but still performs worse than linear search for small enough n.
Despite this, binary search always executes fewer CPU instructions, as the following graph
demonstrates.
So what is going on here? If linear search is faster than binary search for small n, but
always executes more instructions, then linear search must execute more Instructions Per Cycle (IPC),
which the following chart plots:
So the question is, why does linear search achieve so much higher IPC than binary search? The answer is branch prediction. In order to give themselves work to do (rather than stalling the pipeline on every branch), modern CPUs execute instructions speculatively. In particular, they attempt to predict whether or not a particular branch will be taken. In a linear search, the branch is always false until the very last iteration, when it becomes true. Therefore, the CPU can easily predict the branch in linear search. In binary search, assuming search keys are randomly selected, the branch is always taken with probability 1/2 and the CPU therefore cannot predict at all whether the branch will be taken.
We can measure this behavior very precisely. Using PMU events we can count branch mispredictions. In linear search there is always exactly 1 misprediction. In binary search, there are about mispredictions. That's because binary search takes iterations to find the search key, and each branch is mispredicted with probability 1/2. The chart below illustrates this.
Is This CPU specific?
A natural question now is, does this hold for other CPU models? The following chart plots cycles vs array size for a Zen 2 CPU. For this CPU we observe a very similar behavior.
But how about on a MacBook Pro? On my M2 MacBook Pro, binary search was always faster than linear search! The following chart illustrates this (here we plot real-time instead of cycles because the perf events API is Linux-specific).
Likewise, on my Raspberry Pi 4, binary search is always faster. So the takeaway, is that you need to measure it on whatever platform you care about!
Is this Workload Dependent?
In our previous measurements, I measured the runtime of binary search for randomly selected search keys. However, it's very possible that in a real life workload the search keys would not be randomly selected, and instead we would be repeatedly searching the same array for the values in the same range.
In this case, it's very possible that branch predictor could do a much better job predicting the branch in the binary search.
The plot below compares cycles for random binary search, predictable binary search, and linear search. The predictable binary search case is taken to the extreme, where we repeatedly search for the same value (also note that for array sizes so small, the entire array fits in the L1 cache, so in each case there are no cache misses).
The predictable case is by far the fastest! That is because now, the CPU branch predictor is able to predict whether or not the branch will be taken.
What About Branch-free Binary Search?
If branch mispredictions are the bottleneck, what if we could eliminate branches entirely?
CPUs support conditional move instructions (cmov on x86) that avoid
branching altogether. This allows us to implement a branch-free binary search:
size_t branchless_binary_search(const std::vector<long>& vec, long target) {
if (vec.empty()) {
return 0;
}
const long* base = vec.data();
size_t length = vec.size();
while (length > 1) {
size_t half = length >> 1;
base = base[half] >= target ? base : base + half;
length -= half;
}
return (base - vec.data()) + (*base < target);
}
In this version of binary search, the compiler generates a cmov instruction instead of a jump instruction (the branch instruction).
As a result, this code results in no branches at all. We see below, that this version is the fastest of all of them with random search keys!
But if we have removed branches, does this workload still benefit from a predictable access pattern? We see below, that it does not. The branchless binary search performs the exact same regardless of the access pattern (it's a little hard to tell because the lines are on top of each other). In this case, a predictable access pattern on binary search with branches is still faster than the branch-free version.
But why then is the branch-free version slower than predictable binary search with branches? That is because the branch-free version has a data dependency from each loop iteration to the next. A given loop iteration of the branch-free version, cannot continue until the previous computations of base and length are fully available.
This is in contrast with the version with branches. That version has a control dependency, which if guessed correctly, does not stall the CPU at all. For that reason, a binary search with branches with a predictable workload outperforms a branch-free version.
This is an interesting difference between data hazards and control hazards, which enables making a tradeoff. A data hazard does not allow for speculation. The CPU will always stall on data hazards. A control hazard allows a CPU to speculate across the branch, at the risk of mispeculating. Therefore, if you have a transform a data dependency into a predictable branch, that is an opportunity for performance improvement. Likewise if you have an unpredictable branch, removing that branch is also an opportunity for performance improvement.
Conclusion
The picture we have is complicated. We've reached a few conclusions:
- For small arrays, linear search can be faster than a branch-y binary search.
- A branch-free binary search will always outperform a linear search.
- A branch-y binary search with a predictable access pattern may outperform a branch-free binary search.
- These findings vary depending on the CPU model.
So the takeaway is to benchmark the workload you're interested in on the CPUs you care about!