|
Problem 46: Computing Longest Common Prefixes |
|
The Suffix array of a non-empty word $y$ is a light and efficient solution for text indexing. It consists in using a binary search procedure to locate patterns inside $y$. To do so the suffixes of $y$ are first sorted in lexicographic order, producing a table $\SA$ that lists the starting positions of the sorted suffixes.
But this standard technique is not sufficient to get a powerful search method. This is why the table $\SA$ is adjoined a second table $\LCP$ that gives the length of longest common prefixes between consecutive suffixes in the sorted list (some more values easy to deduce are also needed). Using both tables, searching $y$ for a word $x$ is then achieved in time $O(|x|+\log |y|)$ instead of a straightforward $O(|x|\log |y|)$ time without the table $\LCP$. Here is the Suffix array of $\sa{abaabababbabbb}$:
| $j$ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | |
| $y[j]$ | $\sa{a}$ | $\sa{b}$ | $\sa{a}$ | $\sa{a}$ | $\sa{b}$ | $\sa{a}$ | $\sa{b}$ | $\sa{a}$ | $\sa{b}$ | $\sa{b}$ | $\sa{a}$ | $\sa{b}$ | $\sa{b}$ | $\sa{b}$ | |
| Rank $r$ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |
| $\SA[r]$ | 2 | 0 | 3 | 5 | 7 | 10 | 13 | 1 | 4 | 6 | 9 | 12 | 8 | 11 | |
| $\LCP[r]$ | 0 | 1 | 3 | 4 | 2 | 3 | 0 | 1 | 2 | 3 | 4 | 1 | 2 | 2 | 0 |
\begin{algorithmic}
\FOR{$r \leftarrow 0$ \TO $|y|-1$}
\STATE $\mathrm{Rank}[\mathrm{SA}[r]] \leftarrow r$
\ENDFOR
\STATE $\ell \leftarrow 0$
\FOR{$j \leftarrow 0$ \TO $|y|-1$}
\STATE $\ell \leftarrow \max\{0,\ell-1\}$
\IF{$\mathrm{Rank}[j] \gt 0$}
\WHILE{$\max\{j+\ell,\mathrm{SA}[\mathrm{Rank}[j]-1]+\ell\}\lt |y|$ \AND
$y[j+\ell]=y[\mathrm{SA}[\mathrm{Rank}[j]-1]+\ell]$}
\STATE $\ell \leftarrow \ell+1$
\ENDWHILE
\ELSE
\STATE $\ell \leftarrow 0$
\ENDIF
\STATE $\mathrm{LCP}[\mathrm{Rank}[j]] \leftarrow \ell$
\ENDFOR
\STATE $\mathrm{LCP}[|y|] \leftarrow 0$
\RETURN{$\mathrm{LCP}$}
\end{algorithmic}
Note the solution is counterintuitive, since it looks natural to compute the values $\LCP[r]$ sequentially, that is, by processing suffixes in the increasing order of their ranks. But this does not readily produce a linear-time algorithm. Instead, Algorithm Lcp processes the suffixes from the longest to the shortest, which is its key feature and is more efficient.