Optimal matching explained

Optimal matching is a sequence analysis method used in social science, to assess the dissimilarity of ordered arrays of tokens that usually represent a time-ordered sequence of socio-economic states two individuals have experienced. Once such distances have been calculated for a set of observations (e.g. individuals in a cohort) classical tools (such as cluster analysis) can be used. The method was tailored to social sciences[1] from a technique originally introduced to study molecular biology (protein or genetic) sequences (see sequence alignment). Optimal matching uses the Needleman-Wunsch algorithm.

Algorithm

Let

S=(s1,s2,s3,\ldotssT)

be a sequence of states

si

belonging to a finite set of possible states. Let us denote

{S}

the sequence space, i.e. the set of all possible sequences of states.

Optimal matching algorithms work by defining simple operator algebras that manipulate sequences, i.e. a set of operators

ai:{S}{S}

. In the most simple approach, a set composed of only three basic operations to transform sequences is used:

s

is inserted in the sequence
\rmIns
a
s'

(s1,s2,s3,\ldotssT)=(s1,s2,s3,\ldots,s',\ldotssT)

\rmDel
a
s2

(s1,s2,s3,\ldotssT)=(s1,s3,\ldotssT)

and

s1

is replaced (substituted) by state

s'1

,
\rmSub
a
s1,s'1

(s1,s2,s3,\ldotssT)=(s'1,s2,s3,\ldotssT)

.

Imagine now that a cost

c(ai)\in

+
{R}
0
is associatedto each operator. Given two sequences

S1

and

S2

,the idea is to measure the cost of obtaining

S2

from

S1

using operators from the algebra. Let

A={a1,a2,\ldotsan}

be a sequence of operators such that the application of all the operators of this sequence

A

to the first sequence

S1

gives the second sequence

S2

:

S2=a1\circa2\circ\ldots\circan(S1)

where

a1\circa2

denotes the compound operator. To this set we associate the cost

c(A)=

n
\sum
i=1

c(ai)

, thatrepresents the total cost of the transformation. One should consider at this point that there might exist different such sequences

A

that transform

S1

into

S2

; a reasonable choice is to select the cheapest of such sequences. We thuscall distance

d(S1,S2)=minA\left\{c(A)~{\rmsuch~that}~S2=A(S1)\right\}


that is, the cost of the least expensive set of transformations that turn

S1

into

S2

. Notice that

d(S1,S2)

is by definition nonnegative since it is the sum of positive costs, and trivially

d(S1,S2)=0

if and only if

S1=S2

, that is there is no cost. The distance function is symmetric if insertion and deletion costs are equal

c(a\rm)=c(a\rm)

; the term indel cost usually refers to the common cost of insertion and deletion.

Considering a set composed of only the three basic operations described above, this proximity measure satisfies the triangular inequality. Transitivity however, depends on the definition of the set of elementary operations.

Criticism

Although optimal matching techniques are widely used in sociology and demography, such techniques also have their flaws. As was pointed out by several authors (for example L. L. Wu[2]), the main problem in the application of optimal matching is to appropriately define the costs

c(ai)

.

Software

References and notes

  1. A. Abbott and A. Tsay, (2000) Sequence Analysis and Optimal Matching Methods in Sociology: Review and Prospect Sociological Methods & Research
  2. L. L. Wu. (2000) Some Comments on "Sequence Analysis and Optimal Matching Methods in Sociology: Review and Prospect" Sociological Methods & Research, 29 41-64.