When Is Linear Search Faster Than Binary Search?

There are two primary ways to search a sorted array for a given key: binary search and linear search. You probably know that binary search is an O(logn) algorithm and linear search is an O(n) algorithm, and therefore for large enough values of n, binary search will be always faster. However, for small enough values of n, linear search can be faster than binary search. The goal of this post is to answer when and why linear search can be faster than binary search.

The full code for this post can be found here.

Contents

The Code

In this post we analyze the following implementations of binary search and linear search (compiled with O3 and march=native on each platform). For completeness, we will also compare these implementations with std::lower_bound.


size_t binary_search(const std::vector<long>& vec, long target) {
    size_t low = 0, high = vec.size();

    while (low < high) {
        // Interestingly (low+high)/2 is measurably faster
        // for small arrays than low + (high - low)/2
        size_t mid = (low+high)/2;
        if (vec[mid] < target) {
            low = mid + 1;
        } else {
            high = mid;
        }
    }
    return low;
}

size_t linear_search(const std::vector<long>& vec, long target) {
    size_t idx = 0;
    for (; idx < vec.size(); ++idx) {
        if (vec[idx] >= target) {
            break;
        }
    }
    return idx;
}

Measurements

Using Google Benchmark, I measured the performance of the linear search and binary search on different array sizes. The following graph plots cycle counts on an AMD Ryzen 9 9900X (a Zen 5 CPU):

Binary Search
std::lower_bound
Linear Search
0100200300400500600700800050100150200250300350400450500550Number of ElementsCycles
Binary Search Performance on an AMD Zen 5 CPU.

As expected, the runtime of binary search grows logarithmically and the runtime of linear search grows linearly. However, for small arrays (fewer than 136 elements), linear search is actually faster than binary search! std::lower_bound performs better than our binary search implementation but still performs worse than linear search for small enough n. Despite this, binary search always executes fewer CPU instructions, as the following graph demonstrates.

Binary Search
std::lower_bound
Linear Search
010020030040050060070080002004006008001000120014001600180020002200Number of ElementsInstructions
Instruction counts on different array sizes on an AMD Zen 5 CPU.

So what is going on here? If linear search is faster than binary search for small n, but always executes more instructions, then linear search must execute more Instructions Per Cycle (IPC), which the following chart plots:

Binary Search
std::lower_bound
Linear Search
010020030040050060070080000.511.522.533.544.5Number of ElementsIPC
Instructions per cycle on different array sizes on an AMD Zen 5 CPU.

So the question is, why does linear search achieve so much higher IPC than binary search? The answer is branch prediction. In order to give themselves work to do (rather than stalling the pipeline on every branch), modern CPUs execute instructions speculatively. In particular, they attempt to predict whether or not a particular branch will be taken. In a linear search, the branch is always false until the very last iteration, when it becomes true. Therefore, the CPU can easily predict the branch in linear search. In binary search, assuming search keys are randomly selected, the branch is always taken with probability 1/2 and the CPU therefore cannot predict at all whether the branch will be taken.

We can measure this behavior very precisely. Using PMU events we can count branch mispredictions. In linear search there is always exactly 1 misprediction. In binary search, there are about 12log2n\frac{1}{2}\log_2n mispredictions. That's because binary search takes log2(n)\log_2(n) iterations to find the search key, and each branch is mispredicted with probability 1/2. The chart below illustrates this.

Binary Search
std::lower_bound
Linear Search
010020030040050060070080000.511.522.533.544.55Number of ElementsBranch Prediction Misses
Branch prediction misses on different array sizes on an AMD Zen 5 CPU.

Is This CPU specific?

A natural question now is, does this hold for other CPU models? The following chart plots cycles vs array size for a Zen 2 CPU. For this CPU we observe a very similar behavior.

Binary Search
std::lower_bound
Linear Search
0100200300400500600700800050100150200250300350400450500Number of ElementsCycles
Cycle counts on different array sizes on an AMD Zen 2 CPU.

But how about on a MacBook Pro? On my M2 MacBook Pro, binary search was always faster than linear search! The following chart illustrates this (here we plot real-time instead of cycles because the perf events API is Linux-specific).

Binary Search
std::lower_bound
Linear Search
0100200300400500600700800020406080100120140160180Number of ElementsReal Time (ns)
Real time different array sizes on a MacBook Pro.

Likewise, on my Raspberry Pi 4, binary search is always faster. So the takeaway, is that you need to measure it on whatever platform you care about!

Binary Search
std::lower_bound
Linear Search
01002003004005006007008000200400600800100012001400Number of ElementsCycles
Cycle counts on different array sizes on a Raspberry Pi 4.

Is this Workload Dependent?

In our previous measurements, I measured the runtime of binary search for randomly selected search keys. However, it's very possible that in a real life workload the search keys would not be randomly selected, and instead we would be repeatedly searching the same array for the values in the same range.

In this case, it's very possible that branch predictor could do a much better job predicting the branch in the binary search.

The plot below compares cycles for random binary search, predictable binary search, and linear search. The predictable binary search case is taken to the extreme, where we repeatedly search for the same value (also note that for array sizes so small, the entire array fits in the L1 cache, so in each case there are no cache misses).

Random Binary Search
std::lower_bound
Predictable Binary Search
Linear Search
0100200300400500600700800050100150200250300350400450500550Number of ElementsCycles
Cycles vs. input size on an AMD Zen 5 CPU.

The predictable case is by far the fastest! That is because now, the CPU branch predictor is able to predict whether or not the branch will be taken.

If branch mispredictions are the bottleneck, what if we could eliminate branches entirely? CPUs support conditional move instructions (cmov on x86) that avoid branching altogether. This allows us to implement a branch-free binary search:


size_t branchless_binary_search(const std::vector<long>& vec, long target) {
    if (vec.empty()) {
        return 0;
    }

    const long* base = vec.data();
    size_t length = vec.size();

    while (length > 1) {
        size_t half = length >> 1;
        base = base[half] >= target ? base : base + half;
        length -= half;
    }
    return (base - vec.data()) + (*base < target);
}

In this version of binary search, the compiler generates a cmov instruction instead of a jump instruction (the branch instruction). As a result, this code results in no branches at all. We see below, that this version is the fastest of all of them with random search keys!

Binary Search
Branchless Binary Search
std::lower_bound
Linear Search
0100200300400500600700800050100150200250300350400450500550Number of ElementsCycles
Binary Search performance on an AMD Zen 5 CPU.

But if we have removed branches, does this workload still benefit from a predictable access pattern? We see below, that it does not. The branchless binary search performs the exact same regardless of the access pattern (it's a little hard to tell because the lines are on top of each other). In this case, a predictable access pattern on binary search with branches is still faster than the branch-free version.

Binary Search Predictable
Branchless Random Binary Search
Branchless Predictable Binary Search
010020030040050060070080005101520253035Number of ElementsCycles
Binary Search performance on an AMD Zen 5 CPU.

But why then is the branch-free version slower than predictable binary search with branches? That is because the branch-free version has a data dependency from each loop iteration to the next. A given loop iteration of the branch-free version, cannot continue until the previous computations of base and length are fully available.

This is in contrast with the version with branches. That version has a control dependency, which if guessed correctly, does not stall the CPU at all. For that reason, a binary search with branches with a predictable workload outperforms a branch-free version.

This is an interesting difference between data hazards and control hazards, which enables making a tradeoff. A data hazard does not allow for speculation. The CPU will always stall on data hazards. A control hazard allows a CPU to speculate across the branch, at the risk of mispeculating. Therefore, if you have a transform a data dependency into a predictable branch, that is an opportunity for performance improvement. Likewise if you have an unpredictable branch, removing that branch is also an opportunity for performance improvement.

Conclusion

The picture we have is complicated. We've reached a few conclusions:

  • For small arrays, linear search can be faster than a branch-y binary search.
  • A branch-free binary search will always outperform a linear search.
  • A branch-y binary search with a predictable access pattern may outperform a branch-free binary search.
  • These findings vary depending on the CPU model.

So the takeaway is to benchmark the workload you're interested in on the CPUs you care about!