30.5. Boyer-Moore String Search Algorithm¶
Like the KMP algorithm, a string search algorithm developed by Boyer and Moore in 1977 initially examines the structure of the string \(sub\) to see if it can be realigned a considerable distance to the right, when a mismatch occurs. Unlike the KMP algorithm, the Boyer‑Moore algorithm compares the characters of the string \(sub\) to that of the \(master\) string in a right‑to‑left fashion. The hope is that this will allow realignments of considerable magnitude when a mismatch occurs early in the comparison of \(sub\) against a portion of \(master\). For instance, suppose that, at the beginning of a right‑to‑left scan of \(sub\) aligned against a portion of \(master\), we find the character "L" in \(master\) and some other non-matching character in the rightmost index of \(sub\). Then, if "L" does not occur anywhere else in \(sub\), \(sub\) may be realigned so that the character in index 0 of \(sub\) aligns with the character immediately to the right of “L” in \(master\). (Why?) Analogously, if the first "L" to the left of the final position in \(sub\) occurs at index \(i\) of \(sub\), then \(sub\) may be realigned so that index \(i\) in \(sub\) is aligned with the “L” in \(master\). (Why?) Thus, when a mismatch occurs at the rightmost (that is, the first examined) position of \(sub\), the character in \(master\) that caused the mismatch can be used to tell us how much \(sub\) can be realigned to the right. A pre-processing pass through \(sub\) could be used to determine the amount of realignment for any possible character that could occur in \(master\). This information is called the "mismatched character heuristic" and is stored in an array that we will identify by the name \(MMC\).
To supplement this mismatched character heuristic, the Boyer-Moore algorithm uses another \(align\) array that contains re-alignment information defined as follows.
where \(suffix\_length\) is the length of the suffix of the string beginning at position \(p + 1\) and \(offset\) is the least amount this suffix must be moved to the left to match another occurrence of itself in \(sub\) without matching the character in position in \(p\). This motion to the left may involve the leftmost characters "sliding off the end of the master string". When this occurs, those characters that have slid off the end of the master string are viewed as matching the non-existent characters to which they would be compared. The computation of the \(align\) array can be tricky. It somewhat resembles the computation of the KMP \(align\) array but is done "in reverse" because of the right-to-left scan done by Boyer-Moore. We will further study the computation of the \(align\) array later in this module. However, let us first watch a slideshow of the entire Boyer-Moore algorithm, under the assumption that both the mismatched character heuristic and this "reverse KMP" \(align\) array have been computed.
Now that you've seen how the Boyer-Moore algorithm works once the mismatched character and reverse KMP alignments have been pre-computed, use the next two slideshows to study in more detail how the pre-computation of these two alignment tables would be done.
Slideshow for Boyer-Moore Mismatched Character Table Construction
Slideshow for Boyer-Moore "Reverse KMP" Alignment Table Construction
We've seen from the above slideshows that there are really three algorithms at play in Boyer-Moore:
The main algorithm using two pre-computed re-alignment tables::
m = Sub.length - 1
while m < Master.length:
s = Sub.length - 1
while s >= 0 and Master[m] = Sub[s]: m = m-1, s = s-1
if s < 0: return m+1
else: m = m + larger_of(MMC[master[m]], Align[s])
return -1
The algorithm to compute the mismatched character table::
p = current character in alphabet
if alphabet[p] doesn't exist in string then: MMC[p] = string.length
else: MMC[p] = distance from right end of string to furthest right occurrence of alphabet[p] in string.
And the algorithm to compute the reverse-KMP alignment table::
p = current_index
suffix_length = length of the suffix of string beginning at p+1
offset = least amount that the suffix must be moved left to match another occurrence of itself
that isn't preceded by the same character that is at string[p]
if p = string.length()-1 then:
align[p] = 1
else
align[p] = suffix_length + offset
Keeping in mind the pseudocode for these algorithms, test yourself on Boyer-Moore by completing the following four exercises.
- Exercise in tracing one step of the Boyer-Moore algorithm
- Exercise in tracing one step of the Boyer-Moore Mismatched Character Table Construction
- Exercise in tracing one step of the Boyer-Moore Alignment Table Construction
- Proficiency Exercise in tracing entire Boyer-Moore algorithm