Knuth-Morris-Pratt(KMP) Algorithm: String Matching
A string matching/searching algorithm that is used to search for a pattern in a given string. Conceptually, there are two steps to implement KMP algorithm, build a LPS(Longest Proper Prefix which is also a Suffix) array and then perform string matching. KMP algorithm utilizes the pre-computed LPS array to avoid redundant old comparisons when get a mismatch. KMP algorithm runs in linear time.
Table of Contents· Naive Approach for String Matching
· LPS Array
∘ LPS
∘ How is LPS related to String Matching?
∘ LPS array
· How Does Knuth-Morris-Pratt(KMP) Algorithm Work?
∘ Build LPS array
∘ String matching
· Graphical Explanation
∘ Build LPS array
∘ String matching
· Code Implementation
∘ Complexity
∘ Golang
∘ Python
Naive Approach for String Matching
string: a larger string
pattern: a smaller string
looking for the pattern in the string
Iterate over every character in the string to check whether if the pattern is occurred in the string. This approach always starts matching the whole pattern from the beginning if it gets a mismatch.
LPS Array
proper prefix: all prefixes except the entire string itself
LPS
LPS stands for Longest Proper Prefix which is also a Suffix. In brief, split a string into 3 substrings, the first substring called prefix and the third substring called suffix. Most importantly, the prefix is equal to suffix.
prefix == suffix
Therefore, it means the same substring occurs more than once in a string.
Example: index: 012345678
string: aabbcaabbprefix: aabb
suffix: aabb
prefix == suffix
→ LPS: aabb
How is LPS related to String Matching?
While implementing string matching, having LPS information is quite handy. After a mismatch, we go back to the letter right before the current mismatch letter and check if LPS is present in the substring ending at this letter. The prefix and suffix are the same substring that appears at different positions in a string. If the suffix has been matched correctly, the prefix must be matched too. Therefore, we just jump back to the next letter of the early occurring prefix to resume the process of string matching.
LPS array
In order to jump back to the next letter of the prefix at a given index after a mismatch, LPS array is used to store the ending index or position of the early occurring prefix for every substring starting at index 0.
You can store -1/0 to represent no LPS is recognized at the given index and ending index/position of the early occurring prefix in LPS array. Both way follow the same rules.
Knuth-Morris-Pratt(KMP) Algorithm utilizes LPS array to determine how many letters can be skipped for comparison after a mismatch.
How Does Knuth-Morris-Pratt(KMP) Algorithm Work?
string: a larger string
pattern: a smaller string
looking for the the pattern in the string
Knuth-Morris-Pratt(KMP) algorithm are used to search for a pattern in a given string. Firstly, it finds repeated substrings called LPS in the pattern and store LPS information in an array. Secondly, perform string matching. While a mismatch takes place, it utilizes LPS array to decide where to begin the next match to avoid redundant old comparisons.
Build LPS array
- step1: Initialize a LPS array of the same size as the pattern and set all array elements to -1.
-1 denotes no LPS is recognized at the given index
- step2: Define variables i and j and set i = 1, j = 0.
- step3: Compare the letters at pattern[i] and pattern[j]
— match:
store the ending index of the prefix by and increment i and j value by one.
LPS[i] = j
i += 1
j += 1
— mismatch:
j == 0
→ i += 1j > 0:
Go back to the previous index(j - 1), find LPS at this index(LPS[j-1]) and then resume the process of matching at the next index of the early occurring prefix
→ j = LPS[j-1] + 1
Repeat step3 until all the elements in the pattern are iterated.
String matching
The implementation of string matching is quite similar to the process of building the LPS array except compare letters in the main string and the pattern.
- step1: Define variables i and j to denote the current index at the main string and the pattern respectively. Set i = 0, j = 0.
- step2: Compare the letters at string[i] and pattern[j]
— match: increment i and j value by one respectively.
i += 1
j += 1
— mismatch:
j == 0
→ i += 1j > 0:
Go back to the previous index(j - 1), find LPS at this index(LPS[j-1]) and then resume the process of matching at the next index of the early occurring prefix
→ j = LPS[j-1] + 1
Repeat step2 until the whole pattern is matched or the length of unchecked in the main string is less than the length of the pattern string.
Graphical Explanation
Build LPS array
String matching
Code Implementation
Complexity
- Naive approach
Time: O(n*m)
Space: O(1) - KMP algorithm
Time: O(n+m)
Space: O(m) LPS array
n: the length of the main string
m: the length of the pattern
Golang
- Naive approach
- KMP algorithm
Python
- Naive approach
- KMP algorithm
You can access the source code here.