09 · String Matching
KMP vs naive · same haystack · same needle · who finishes with fewer comparisons?
What you're watching
Both algorithms search for the needle inside the haystack and report every match position. They produce identical results; they differ only in how many character comparisons they do along the way. The "comparisons" stat above each panel is the fair efficiency metric.
AAAAAAAAAAAAAAAAB, needle AAAAB):
the naive algorithm re-scans almost every character on every
failed alignment — its comparison count blows up. KMP's failure
function records that "after matching AAAA, you've
already established the next 4 characters in the needle should
also start with AAAA" and skips ahead efficiently.
The comparison-count gap is the entire point of KMP.
The two algorithms
Naive substring search O(n·m) worst case
For every position i in 0..n-m, try
matching the needle character by character against
haystack[i..i+m-1]. On any mismatch, slide the
needle one position right and start over. Each haystack character
can be re-examined up to m times — hence the O(nm) worst case.
Easy to implement, never fast.
Knuth-Morris-Pratt (KMP) O(n + m)
Pre-compute a failure function for the needle in O(m):
failure[q] is the length of the longest proper prefix
of needle[0..q] that is also a suffix of it. This
encodes "if I've matched the first q+1 characters of the needle
and then mismatch at q+1, how many of those q characters can I
keep using as a free head-start on the next attempt?"
During matching, on a mismatch at needle position q,
instead of restarting we set q = failure[q-1] and
retry from there. No haystack character is ever re-examined,
so total time is O(n + m).
Reading the panels
- Top row: haystack characters. The currently-compared character is highlighted blue (match) or red (mismatch). Past matches' span is filled solid blue.
- Middle row: needle, slid into its current anchor position relative to the haystack. Already-matched prefix is filled blue.
- Bottom row (KMP only): the failure function array, one value per needle character. Filled live during the build phase, then read-only during matching.
- Connecting line: drawn between the two characters currently being compared. Blue on a match, red on a mismatch.
What I learned writing this
-
The failure function is the entire algorithm.
Once you have
failure[], the match phase is six lines of code: while-loop with a single conditional jump. The trick KMP gets the credit for is a one-time O(m) pre-computation that encodes everything you need to know about the needle's self-similarity structure. - Building the failure function is itself a KMP-on-the-needle. The build loop has the same structure as the match loop: a "candidate length" k that gets reduced via fail[k-1] on mismatch. The needle scans against a shifted copy of itself, discovering its own internal repetitions. Self-referential and a little magical.
- Naive's pathology is on inputs that look like patterns KMP shines on. Random text rarely triggers naive's O(nm) — most mismatches happen at j=0 or j=1, so the slide-by-1 isn't expensive. KMP's real win is on highly self-similar needles (lots of "AAAA"-like prefixes). The "KMP advantage" preset above is the canonical example.
- Visualising the failure function is harder than visualising its use. The build phase has nested logic: outer q advances, inner k bounces backward. Drawing arrows for that gets messy; I settled for showing failure[] as a strip of numbers that fill in left-to-right, with the current q highlighted. The reader can mentally connect the dots without seeing every internal jump.