There are two primary ways to search a sorted array for a given key: binary search and linear search.
You probably know that binary search is an O(logn) algorithm and linear search is an O(n) algorithm,
and therefore are aware that for large enough values of n, binary search will be always faster.
However, for surprisingly large values of n, linear search can be faster than binary search.
The goal of this post is to answer when and why linear search can be faster than binary search.
The full code for this post can be found here.
The Code
In this post we analyze the following implementations of binary search and linear search (compiled with O3 and march=native):
size_t binary_search(const std::vector<long>& vec, long target) {
size_t low = 0, high = vec.size();
while (low < high) {
size_t mid = low + (high - low)/2;
if (vec[mid] < target) {
low = mid + 1;
} else {
high = mid;
}
}
return low;
}
size_t linear_search(const std::vector<long>& vec, long target) {
size_t idx = 0;
for (; idx < vec.size(); ++idx) {
if (vec[idx] >= target) {
break;
}
}
return idx;
}
Measurements
Using Google Benchmark, I measured the performance of linear search and binary search on different array sizes. The following graph plots cycle counts on an AMD Ryzen 9 9900X (a Zen 5 CPU):
As expected, the runtime of binary search grows logarithmically and the runtime of linear search grows linearly. However, for small arrays (fewer than 136 elements), linear search is actually faster than binary search! This is despite the fact that binary search always executes fewer CPU instructions, as the following graph demonstrates.
So what is going on here? If linear search is faster than binary search for small n, but
always executes more instructions, then linear search must execute more Instructions Per Cycle (IPC),
which the following chart plots:
So the question is, why does linear search achieve so much higher IPC than binary search? The answer is branch prediction. In order to give themselves work to do (rather than stalling the pipeline on every branch), modern CPUs execute instructions speculatively. In particular, they attempt to predict whether or not a particular branch will be taken (on x86 CPUs, branch instructions are the various jump instructions). In a linear search, the branch is always false until the very last iteration, when it becomes true. Therefore, the CPU can easily predict the branch in linear search. In binary search, assuming search keys are randomly selected, the branch is always taken with probability 1/2 and the CPU therefore cannot predict at all whether the branch will be taken.
We can measure this behavior very precisely. Using PMU events we can count branch mispredictions. In linear search there is always exactly 1 misprediction. In binary search, there are about mispredictions. That's because binary search takes iterations to find the search key, and each branch is mispredicted with probability 1/2. The chart below illustrates this.
Is This CPU specific?
A natural question now is, does this hold for other CPU models? The following chart plots cycles vs array size for a Zen 2 CPU. For this CPU we observe a very similar behavior.
But how about on a MacBook Pro? On my M2 MacBook Pro, binary search was always faster than linear search! The following chart illustrates this (here we plot real-time instead of cycles because the perf events API is Linux-specific).
Likewise, on my Raspberry Pi 4, binary search is always faster. So the takeaway, is that you need to measure it on whatever platform you care about!
Is this Workload Dependent?
In our previous measurements, I measured the runtime of binary search for randomly selected search keys. However, it's very possible that in a real life workload the search keys would not be randomly selected, and instead we would be repeatedly searching the same array for the values in the same range.
In this case, it's very possible that branch predictor could do a much better job predicting the branch in the binary search.
The plot below compares cycles for random binary search, predictable binary search, and linear search. The predictable binary search case is taken to the extreme, where we repeatedly search for the same value (also note that for array sizes so small, the entire array fits in the L1 cache, so in each case there are no cache misses).
The predictable case is by far the fastest! That is because now, the CPU branch predictor is able to predict whether or not the branch will be taken.
So the picture we have is complicated. Whether or not linear search is faster than binary search on small arrays depends on your CPU and also on your access pattern. The takeaway is that you need to measure the workload you care about on the CPU you care about!