Here  are the notes for Chapter 4. The whole document can be found HERE.


In this chapter we will introduce a linear-time algorithm to solve the string matching computation problem. Unlike the usual way to proceed, where algorithms are presented as they were published, here we will put forward a more recent linear-time algorithm. Although, this algorithm was not the first linear algorithm, it is conceptually relatively simple and fundamental and will help the reader understanding forthcoming algorithms later on. This algorithm is called the  Z algorithm and is attributed to Gusfield [Gus96] and [Gus97].

Many string matching algorithms try to avoid computing some of the n-m+1 shifts of the naive algorithm. To do so, they spend a small amount of time (as compared to the overall complexity) learning about the complexity of the pattern or the text. That learning step is called pre-processing. Algorithms such as the Knut-Morris-Pratt algorithm [KMP77], the Boyer-Moore algorithm [BM77] or the Apostolico-Giancarlo algorithm [AG86] are examples where the pattern undergoes pre-processing before actually being searched in the text. On the other hand,  methods such as suffix trees are based on text pre-processing. Some of those algorithms are alphabet-dependent, that is, their time complexity depends on the size of Ʃ.



Since pre-processing is used in more contexts other than string pattern recognition, we will denote by S an arbitrary string instead of using the usual letter P. Let S be a string and i>1 one of its positions. Starting at i, we consider all the substrings S[i..j], where i<=j<=|S|, so that S[i..j] matches a prefix of S itself. Among all the possible values j may take, select the greatest one and denote it by Zi(S). If S[i] is different from S[1], then such a j does not exist and Zi(S) is set to 0. In other words, Zi(S) is defined as the maximum length such that  $S[i..Zi(S)] is a prefix of S[1..Zi(S)]. When S is clear from the context, we will simplify the notation and just write Z(S).

Go to top