09 · String Matching

KMP vs naive · same haystack · same needle · who finishes with fewer comparisons?

Haystack Needle Speed 8×

try:

What you're watching

Both algorithms search for the needle inside the haystack and report every match position. They produce identical results; they differ only in how many character comparisons they do along the way. The "comparisons" stat above each panel is the fair efficiency metric.

RECOMMENDED PRESET

"KMP advantage · sparse needle" (haystack AAAAAAAAAAAAAAAAB, needle AAAAB): the naive algorithm re-scans almost every character on every failed alignment — its comparison count blows up. KMP's failure function records that "after matching AAAA, you've already established the next 4 characters in the needle should also start with AAAA" and skips ahead efficiently. The comparison-count gap is the entire point of KMP.

The two algorithms

Naive substring search O(n·m) worst case

For every position i in 0..n-m, try matching the needle character by character against haystack[i..i+m-1]. On any mismatch, slide the needle one position right and start over. Each haystack character can be re-examined up to m times — hence the O(nm) worst case. Easy to implement, never fast.

Knuth-Morris-Pratt (KMP) O(n + m)

Pre-compute a failure function for the needle in O(m): failure[q] is the length of the longest proper prefix of needle[0..q] that is also a suffix of it. This encodes "if I've matched the first q+1 characters of the needle and then mismatch at q+1, how many of those q characters can I keep using as a free head-start on the next attempt?"

During matching, on a mismatch at needle position q, instead of restarting we set q = failure[q-1] and retry from there. No haystack character is ever re-examined, so total time is O(n + m).

Reading the panels

Top row: haystack characters. The currently-compared character is highlighted blue (match) or red (mismatch). Past matches' span is filled solid blue.
Middle row: needle, slid into its current anchor position relative to the haystack. Already-matched prefix is filled blue.
Bottom row (KMP only): the failure function array, one value per needle character. Filled live during the build phase, then read-only during matching.
Connecting line: drawn between the two characters currently being compared. Blue on a match, red on a mismatch.

What I learned writing this

The failure function is the entire algorithm. Once you have failure[], the match phase is six lines of code: while-loop with a single conditional jump. The trick KMP gets the credit for is a one-time O(m) pre-computation that encodes everything you need to know about the needle's self-similarity structure.
Building the failure function is itself a KMP-on-the-needle. The build loop has the same structure as the match loop: a "candidate length" k that gets reduced via fail[k-1] on mismatch. The needle scans against a shifted copy of itself, discovering its own internal repetitions. Self-referential and a little magical.
Naive's pathology is on inputs that look like patterns KMP shines on. Random text rarely triggers naive's O(nm) — most mismatches happen at j=0 or j=1, so the slide-by-1 isn't expensive. KMP's real win is on highly self-similar needles (lots of "AAAA"-like prefixes). The "KMP advantage" preset above is the canonical example.
Visualising the failure function is harder than visualising its use. The build phase has nested logic: outer q advances, inner k bounces backward. Drawing arrows for that gets messy; I settled for showing failure[] as a strip of numbers that fill in left-to-right, with the current q highlighted. The reader can mentally connect the dots without seeing every internal jump.

← Back to all algorithms