In mathematics, the Chvátal–Sankoff constants are mathematical constants that describe the lengths of longest common subsequences of random strings. Although the existence of these constants has been proven, their exact values are unknown. They are named after Václav Chvátal and David Sankoff, who began investigating them in the mid-1970s.[1] [2]
There is one Chvátal–Sankoff constant
\gammak
\gamma2
A common subsequence of two strings S and T is a string whose characters appear in the same order (not necessarily consecutively) both in S and in T. The problem of computing a longest common subsequence has been well studied in computer science. It can be solved in polynomial time by dynamic programming;[4] this basic algorithm has additional speedups for small alphabets (the Method of Four Russians),[5] for strings with few differences, for strings with few matching pairs of characters,[6] etc. This problem and its generalizations to more complex forms of edit distance have important applications in areas that include bioinformatics (in the comparison of DNA and protein sequences and the reconstruction of evolutionary trees), geology (in stratigraphy), and computer science (in data comparison and revision control).[7]
One motivation for studying the longest common subsequences of random strings, given already by Chvátal and Sankoff, is to calibrate the computations of longest common subsequences on strings that are not random. If such a computation returns a subsequence that is significantly longer than what would be obtained at random, one might infer from this result that the match is meaningful or significant.[1]
The Chvátal–Sankoff constants describe the behavior of the following random process. Given parameters n and k, choose two length-n strings S and T from the same k-symbol alphabet, with each character of each string chosen uniformly at random, independently of all the other characters. Compute a longest common subsequence of these two strings, and let
λn,k
λn,k
\gammak
More precisely, the expected value
\operatorname{E}[λn,k]
\operatorname{E}[λm+n,k]\ge\operatorname{E}[λm,k]+\operatorname{E}[λn,k]
\gammak=\limn\toinfty
\operatorname{E | |
[λ |
n,k]}{n}
\operatorname{E}[λn,k]/n
\gammak
The exact values of the Chvátal–Sankoff constants remain unknown, but rigorous upper and lower bounds have been proven.
Because
\gammak
\operatorname{E}[λn,k]
\gammak
\operatorname{E}[λn,k]
\gamma2\ge0.788071
k | Lower bound on \gammak | |
---|---|---|
2 | 0.788071 | |
3 | 0.671697 | |
4 | 0.599248 | |
5 | 0.539129 | |
6 | 0.479452 | |
7 | 0.44502 | |
8 | 0.42237 | |
9 | 0.40321 | |
10 | 0.38656 |
also used automata-theoretic methods to prove upper bounds on the Chvátal–Sankoff constants, and again extended these results by computerized calculations. The upper bound he obtained was
\gamma2\le0.826280
\gamma2=2/(1+\sqrt2)
\gamma2
0.811
In the limit as k goes to infinity, the constants
\gammak
\limk\toinfty\gammak\sqrtk=2.
There has also been research into the distribution of values of the longest common subsequence, generalizing the study of the expectation of this value. For instance, the standard deviation of the length of the longest common subsequence of random strings of length n is known to be proportional to the square root of n.[13]
One complication in performing this sort of analysis is that the random variables describing whether the characters at different pairs of positions match each other are not independent of each other. For a more mathematically tractable simplification of the longest common subsequence problem, in which the allowed matches between pairs of symbols are not controlled by whether those symbols are equal to each other but instead by independent random variables with probability 1/k of being 1 and (k - 1)/k of being 0, it has been shown that the distribution of the longest common subsequence length is controlled by the Tracy–Widom distribution.[14]