Knuth-Morris-Pratt(KMP) Algorithm: String Matching

Claire Lee
5 min readOct 15, 2022

--

A string matching/searching algorithm that is used to search for a pattern in a given string. Conceptually, there are two steps to implement KMP algorithm, build a LPS(Longest Proper Prefix which is also a Suffix) array and then perform string matching. KMP algorithm utilizes the pre-computed LPS array to avoid redundant old comparisons when get a mismatch. KMP algorithm runs in linear time.

KMP algorithm summary card

Naive Approach for String Matching

string: a larger string
pattern: a smaller string
looking for the pattern in the string

Iterate over every character in the string to check whether if the pattern is occurred in the string. This approach always starts matching the whole pattern from the beginning if it gets a mismatch.

naive approach for string matching

LPS Array

proper prefix: all prefixes except the entire string itself

LPS

LPS stands for Longest Proper Prefix which is also a Suffix. In brief, split a string into 3 substrings, the first substring called prefix and the third substring called suffix. Most importantly, the prefix is equal to suffix.

prefix == suffix

Therefore, it means the same substring occurs more than once in a string.

LPS
Example: index:  012345678
string: aabbcaabb
prefix: aabb
suffix: aabb
prefix == suffix
→ LPS: aabb

How is LPS related to String Matching?

While implementing string matching, having LPS information is quite handy. After a mismatch, we go back to the letter right before the current mismatch letter and check if LPS is present in the substring ending at this letter. The prefix and suffix are the same substring that appears at different positions in a string. If the suffix has been matched correctly, the prefix must be matched too. Therefore, we just jump back to the next letter of the early occurring prefix to resume the process of string matching.

how LPS related to string matching

LPS array

In order to jump back to the next letter of the prefix at a given index after a mismatch, LPS array is used to store the ending index or position of the early occurring prefix for every substring starting at index 0.

You can store -1/0 to represent no LPS is recognized at the given index and ending index/position of the early occurring prefix in LPS array. Both way follow the same rules.

LPS array

Knuth-Morris-Pratt(KMP) Algorithm utilizes LPS array to determine how many letters can be skipped for comparison after a mismatch.

How Does Knuth-Morris-Pratt(KMP) Algorithm Work?

string: a larger string
pattern: a smaller string
looking for the the pattern in the string

Knuth-Morris-Pratt(KMP) algorithm are used to search for a pattern in a given string. Firstly, it finds repeated substrings called LPS in the pattern and store LPS information in an array. Secondly, perform string matching. While a mismatch takes place, it utilizes LPS array to decide where to begin the next match to avoid redundant old comparisons.

Build LPS array

  • step1: Initialize a LPS array of the same size as the pattern and set all array elements to -1.

-1 denotes no LPS is recognized at the given index

step 1
  • step2: Define variables i and j and set i = 1, j = 0.
step 2
  • step3: Compare the letters at pattern[i] and pattern[j]

match:
store the ending index of the prefix by and increment i and j value by one.

LPS[i] = j
i += 1
j += 1

mismatch:

j == 0
→ i += 1
j > 0:
Go back to the previous index(j - 1), find LPS at this index(LPS[j-1]) and then resume the process of matching at the next index of the early occurring prefix
→ j = LPS[j-1] + 1
step 3

Repeat step3 until all the elements in the pattern are iterated.

String matching

The implementation of string matching is quite similar to the process of building the LPS array except compare letters in the main string and the pattern.

  • step1: Define variables i and j to denote the current index at the main string and the pattern respectively. Set i = 0, j = 0.
  • step2: Compare the letters at string[i] and pattern[j]

match: increment i and j value by one respectively.

i += 1
j += 1

mismatch:

j == 0
→ i += 1
j > 0:
Go back to the previous index(j - 1), find LPS at this index(LPS[j-1]) and then resume the process of matching at the next index of the early occurring prefix
→ j = LPS[j-1] + 1

Repeat step2 until the whole pattern is matched or the length of unchecked in the main string is less than the length of the pattern string.

Graphical Explanation

pattern and string

Build LPS array

1
2
3
4

String matching

1
2

Code Implementation

Complexity

  • Naive approach
    Time: O(n*m)
    Space: O(1)
  • KMP algorithm
    Time: O(n+m)
    Space: O(m) LPS array

n: the length of the main string
m: the length of the pattern

Golang

  • Naive approach
  • KMP algorithm

Python

  • Naive approach
  • KMP algorithm

You can access the source code here.

--

--

Claire Lee
Claire Lee

Responses (2)