Optimal matching is a sequence analysis method used in social science, to assess the dissimilarity of ordered arrays of tokens that usually represent a time-ordered sequence of socio-economic states two individuals have experienced. Once such distances have been calculated for a set of observations (e.g. individuals in a cohort) classical tools (such as cluster analysis) can be used. The method was tailored to social sciences[1] from a technique originally introduced to study molecular biology (protein or genetic) sequences (see sequence alignment). Optimal matching uses the Needleman-Wunsch algorithm.
Let
S=(s1,s2,s3,\ldotssT)
si
{S}
Optimal matching algorithms work by defining simple operator algebras that manipulate sequences, i.e. a set of operators
ai:{S} → {S}
s
\rmIns | |
a | |
s' |
(s1,s2,s3,\ldotssT)=(s1,s2,s3,\ldots,s',\ldotssT)
\rmDel | |
a | |
s2 |
(s1,s2,s3,\ldotssT)=(s1,s3,\ldotssT)
s1
s'1
\rmSub | |
a | |
s1,s'1 |
(s1,s2,s3,\ldotssT)=(s'1,s2,s3,\ldotssT)
Imagine now that a cost
c(ai)\in
+ | |
{R} | |
0 |
S1
S2
S2
S1
A={a1,a2,\ldotsan}
A
S1
S2
S2=a1\circa2\circ\ldots\circan(S1)
a1\circa2
c(A)=
n | |
\sum | |
i=1 |
c(ai)
A
S1
S2
d(S1,S2)=minA\left\{c(A)~{\rmsuch~that}~S2=A(S1)\right\}
S1
S2
d(S1,S2)
d(S1,S2)=0
S1=S2
c(a\rm)=c(a\rm)
Considering a set composed of only the three basic operations described above, this proximity measure satisfies the triangular inequality. Transitivity however, depends on the definition of the set of elementary operations.
Although optimal matching techniques are widely used in sociology and demography, such techniques also have their flaws. As was pointed out by several authors (for example L. L. Wu[2]), the main problem in the application of optimal matching is to appropriately define the costs
c(ai)