KMP (Knuth-Morris-Pratt) Algorithm

Intro

The Knuth-Morris-Pratt (KMP) algorithm searches for a pattern in text in guaranteed O(n + m) time by never rescanning text characters after a mismatch. Its key mechanism is the prefix function (also called the LPS array — Longest Proper Prefix which is also a Suffix), which precomputes for each position in the pattern how far back to fall when a mismatch occurs, reusing work already done instead of starting over.

Use KMP when worst-case linear time matters — for example, scanning large log streams or network traffic where adversarial inputs (like "aaaaab" in "aaaa...a") would degrade naive search to O(nm). For simple one-off searches on typical text, language built-in functions (which often use optimized heuristics) are fine, but KMP gives a strict guarantee that naive search cannot.

How It Works

Step 1 — Build the prefix function (LPS array) in O(m):
For each position j in the pattern, lps[j] stores the length of the longest proper prefix of pattern[0..j] that is also a suffix. This tells the algorithm: on mismatch at position j, the pattern can be shifted so that lps[j-1] characters of the existing match are preserved.

Step 2 — Scan the text in O(n):
Maintain two pointers: i (text position) and j (pattern position). On match, advance both. On mismatch, if j > 0, set j = lps[j-1] (fall back without moving i); if j == 0, advance i. When j reaches m, a full match is found at position i - m.

Total: O(n + m) time, O(m) space for the LPS array. The text pointer i never moves backward — this is what gives KMP its linear guarantee.

graph TD
  S[Input pattern P and text T] --> P0[Precompute LPS array for P]
  P0 --> A[Set i to 0 and j to 0]
  A --> B{i less than len T}
  B -->|No| Z[Done]
  B -->|Yes| C{T at i equals P at j}
  C -->|Yes| D[Increment i and j]
  D --> E{j equals len P}
  E -->|Yes| F[Match found at i minus m]
  F --> G[Set j to LPS at j minus 1]
  G --> B
  E -->|No| B
  C -->|No| H{j greater than 0}
  H -->|Yes| I[Set j to LPS at j minus 1]
  I --> C
  H -->|No| J[Increment i]
  J --> B

Example

Pattern: ABABC    (m=5)
LPS:     [0, 0, 1, 2, 0]

  lps[0]=0: "A" has no proper prefix that is also a suffix
  lps[1]=0: "AB" — no match
  lps[2]=1: "ABA" — prefix "A" matches suffix "A"
  lps[3]=2: "ABAB" — prefix "AB" matches suffix "AB"
  lps[4]=0: "ABABC" — no match

Text: ABABABC
  i=0..3: match ABAB (j=4)
  i=4: T[4]='A' ≠ P[4]='C' → mismatch. Set j=lps[3]=2 (keep "AB" match).
  i=4..6: match from j=2. T[4..6]="ABC" matches P[2..4]="ABC".
  j=5=m → match found at position 2.

Pitfalls

Off-by-One in LPS Construction

Single-Pattern Limitation

Tradeoffs

Choice Option A Option B Decision criteria
Worst-case guarantee needed KMP O(n+m) Naive O(nm) KMP is strictly better on adversarial inputs. Naive is simpler for small inputs where worst case is unlikely.
Multiple patterns Aho-Corasick KMP per pattern Aho-Corasick scans text once for all patterns. Switch when pattern count exceeds 2-3.
Implementation simplicity Rabin-Karp KMP Rabin-Karp is simpler to code (rolling hash) but has probabilistic worst case. KMP is deterministic but requires understanding the prefix function.

Questions

References


Whats next