Rabin Karp Search

Intro

The Rabin-Karp algorithm searches for a pattern in text by comparing hash values instead of characters directly. It computes a hash for the pattern and a rolling hash for each text window of the same length, so moving to the next window costs O(1) instead of recomputing from scratch. When hashes match, a character-by-character verification confirms the result. Expected time is O(n + m), but pathological hash collisions can degrade to O(nm).

Use Rabin-Karp when you need a simple, effective string search that extends naturally to multi-pattern matching — store all pattern hashes in a set and check each window hash against the set in O(1). This makes it practical for plagiarism detection, DNA sequence matching, and log scanning with multiple signatures.

How It Works

Hash function: treat each character as a digit in a base-b number system, modulo a large prime p. For a window T[i..i+m-1]:

hash = (T[i] · b^(m-1) + T[i+1] · b^(m-2) + ... + T[i+m-1]) mod p

Rolling update: to slide the window one position right, remove the leftmost character's contribution, shift, and add the new rightmost character:

hash_new = (hash_old · b - T[i] · b^m + T[i+m]) mod p

This update is O(1), making the full scan O(n) expected. On hash match, verify the actual characters in O(m) to rule out collisions.

graph TD
  S[Input pattern P of length m and text T] --> A[Choose base b and prime modulus p]
  A --> B[Compute hash of P]
  B --> C[Compute hash of first window T from 0 to m minus 1]
  C --> D[Set i to 0]
  D --> E{i at most len T minus m}
  E -->|No| Z[Done no more windows]
  E -->|Yes| F{window hash equals pattern hash}
  F -->|No| G[Roll hash to next window]
  F -->|Yes| H{character by character match}
  H -->|Yes| I[Report match at i]
  H -->|No| J[Hash collision skip]
  I --> G
  J --> G
  G --> K[Increment i]
  K --> E

Example

Pattern: "aba" (m=3), base=10, a=1 b=2 c=3
Pattern hash: 1·100 + 2·10 + 1 = 121

Text: "abacaba"
Window "aba": hash=121 → equals pattern hash → verify → match at 0
Roll to "bac": (121 - 1·100)·10 + 3 = 213 → skip
Roll to "aca": (213 - 2·100)·10 + 1 = 131 → skip
Roll to "cab": (131 - 1·100)·10 + 2 = 312 → skip
Roll to "aba": (312 - 3·100)·10 + 1 = 121 → verify → match at 4

Pitfalls

Poor Modulus Choice Causes Excessive Collisions

Integer Overflow in Hash Arithmetic

Skipping Character Verification

Tradeoffs

Choice Option A Option B Decision criteria
Deterministic guarantee KMP O(n+m) worst case Rabin-Karp O(n+m) expected KMP guarantees linear time regardless of input. Choose Rabin-Karp when expected-case simplicity matters and collisions are manageable.
Multi-pattern search Rabin-Karp with hash set KMP per pattern Rabin-Karp extends naturally to multiple patterns by checking window hash against a set. Switch to Aho-Corasick for very large pattern counts with deterministic guarantees.
Implementation effort Rabin-Karp rolling hash KMP prefix function Rolling hash is simpler to implement correctly. Choose Rabin-Karp for quick prototyping; switch to KMP if collision risk is unacceptable.

Questions

References


Whats next