This post is my note I kept when I was reading the Introduction to Algorithms book.
String matching problem is formalized as follows.
- $T[1\cdots n]$ : An array of length $n$.
- $P[1\cdots m]$ : An array of the pattern that we are looking for.
- Assumption: Both elements of $P$ and $T$ are characters drawn from a finite alphabet $\Sigma$.
According to problems, $\Sigma$ could be $\{0, 1\}$ or $\{a, b, \ldots, z \}$.
And, to make describing the problem easier, we define the following concepts:
- $P$ occurs with shift s (starting from 0): Or pattern $P$ occurs beginning at position $s+1$(starting from 1) if $0 \leq s \leq n-m $ and $T[s+1\ldots s+m] = P[1\ldots m]$. And such a shift position is also called a valid shift. Otherwise, a shift is an invalid shift.
Thus, the string matching is the problem of finding all valid shifts with which a given pattern $P$ occurs in a given text $T$.
Overview of String Matching Algorithms
Algorithm | Preprocessing Time | Matching Time |
---|---|---|
Naive | 0 | $O((n-m+1)m)$ |
Rabin-Karp | $\Theta(m)$ | $O((n-m+1)m)$ |
Finite Automaton | $O(m\vert \Sigma \vert)$ | $\Theta(n)$ |
Knuth-Morris-Pratt | $\Theta(m)$ | $\Theta(n)$ |
The first time I saw this table, I was confused completely. The Rabin-Karp method is definitely off somewhere. Its total running time is worse than the Naive method. Yes, that’s true for the worst cases. But the Rabin-Karp algorithm works much better on average and in practice. And it generalizes nicely to other pattern-matching problems.
Notation and Terminology
- $\Sigma ^ \ast$ : the set of all finite-length strings formed using characters from the alphabet $\Sigma$
- Concatenation: the concatenation of two strings $x$ and $y$ is denoted $xy$.
- Prefix: String $w$ is a prefix of a string $x$, denoted $w \sqsubset x$, if $x = wy$ for some string $y \in \Sigma^\ast$.
- Suffix: denoted $w \sqsupset x$, if $x = yw$ for some $y \in \Sigma^\ast$.
- Empty String: denoted $\varepsilon$
By definition, empty string is both a suffix and a prefix of all strings.
From the notations we already have, we can move forward to have the following lemma.
- Overlapping-suffix Lemma : Suppose that $x$, $y$, and $z$ are strings such that $x\sqsupset z$ and $y \sqsupset z$. If $|x| \leq |z|$, then $x \sqsupset y$. If $|x|\geq |y|$, then $y\sqsupset x$. If $|x| = |y|$, then x=y.
Leave a Comment