 Register  # 22.3. Search in Sorted Arrays¶

## 22.3.1. Analysis¶

For large collections of records that are searched repeatedly, sequential search is unacceptably slow. One way to reduce search time is to preprocess the records by sorting them. Given a sorted array, an obvious improvement over simple linear search is to test if the current element in L is greater than $K$. If it is, then we know that $K$ cannot appear later in the array, and we can quit the search early. But this still does not improve the worst-case cost of the algorithm.

### 22.3.1.3. Lower Bounds Proof¶

So, $O(\log n)$ time for Binary Search seems pretty good. Can we do better than this? We can prove that this is the best possible algorithm in the worst case for searching in a sorted list by using a proof similar to that used to show the lower bound on sorting.

We use the decision tree to model our algorithm. Unlike when searching an unsorted list, comparisons between elements of L tell us nothing new about their relative order (since L is already sorted), so we consider only comparisons between $K$ and an element in L. At the root of the decision tree, our knowledge rules out no positions in L, so all are potential candidates. As we take branches in the decision tree based on the result of comparing $K$ to an element in L, we gradually rule out potential candidates. Eventually we reach a leaf node in the tree representing the single position in L that can contain $K$. There must be at least $n+1$ nodes in the tree because we have $n+1$ distinct positions that $K$ can be in (any position in L, plus not in L at all). Some path in the tree must be at least $\log n$ levels deep, and the deepest node in the tree represents the worst case for that algorithm. Thus, any algorithm on a sorted array requires at least $\Omega(\log n)$ comparisons in the worst case.

We can modify this proof to find the average cost lower bound. Again, we model algorithms using decision trees. Except now we are interested not in the depth of the deepest node (the worst case) and therefore the tree with the least-deepest node. Instead, we are interested in knowing what the minimum possible is for the “average depth” of the leaf nodes. Define the total path length as the sum of the levels for each node. The cost of an outcome is the level of the corresponding node plus 1. The average cost of the algorithm is the average cost of the outcomes (total path length / $n$). What is the tree with the least average depth? This is equivalent to the tree that corresponds to binary search. Thus, binary search is optimal in the average case.

While binary search is indeed an optimal algorithm for a sorted list in the worst and average cases when searching a sorted array, there are a number of circumstances that might lead us to select another algorithm instead. One possibility is that we know something about the distribution of the data in the array. If each position in L is equally likely to hold $K$ (equivalently, the data are well distributed along the full key range), then an interpolation search is $\Theta(\log \log n)$ in the average case. If the data are not sorted, then using binary search requires us to pay the cost of sorting the list in advance, which is only worthwhile if many (at least $O(\log n)$ searches will be performed on the list. Binary search also requires that the list (even if sorted) be implemented using an array or some other structure that supports random access to all elements with equal cost. Finally, if we know all search requests in advance, we might prefer to sort the list by frequency and do linear search in extreme search distributions, or use a self-organizing list.  