sequence alignment problem

1. Setup:

Input:
​ Strings $X = x_1 . . . x_m , Y = y_1 . . . y_n$ over some alphabet Σ like {A,C,G,T}.
​ Penalty $α_{gap}$ for inserting a gap, $α_{ab}$ for matching a & b [presumably $α_{ab} = 0$ of a = b]
Feasible solutions:
​ Alignments - i.e., insert gaps to equalize lengths of the string
Goal:
​ Alignment with minimum possible total penalty

2. Analysis

  • Optimal Substructure
    For a optimal alignment of X, Y, there are 3 relevant possibilities for the contents of the final position:
    $$case 1: x_m & y_n $$

    $$ case 2: x_m & gap$$

    $$ case 3: y_n & gap$$
    If case 1 holds, then induced alignment of X’ & Y’ is optimal;
    If case 2 holds, then induced alignment of X’ & Y is optimal;
    If case 3 holds, then induced alignment of X & Y’ is optimal. (contradiction)

  • Relevant Subproblems
    $(X_i , Y_j)$ where:
    $X_i$ = 1st i letters of X
    $Y_j$ = 1st j letters of Y
    (Only peel off letters from the right end of the strings.)

  • Recurrence
    $P_ij$ = penalty of optimal alignment of $X_i$ & $Y_j$ .
    For all $i = 1, . . . , m$ and $j = 1, . . . , n$:

    1
    P_ij = min(a[xi][yi]+P[i-1][j-1], a_gap+P[i][j-1], a_gap+P[i-1][j])

    Base case:
    $$P_{i,0} = P_{0,i} = i * alpha_{gap}$$

3. The Algorithm

1
2
3
4
5
A[][] is a 2-D array
A[i][0] = A[0][i] = i * a_gap
for i = 1:m
for j = 1:n
A[i][j] = min(A[i-1][j-1]+a_xiyj, A[i-1][j]+a_gap, A[i][j-1]+a_gap);
  • Running time: $O(mn)$

4. Reconstructing a Solution

Trace back from A[m][n].

1
2
3
4
5
6
7
8
9
while (m > 0 && n > 0) {
switch matching_case {
case 1: match x_i, y_j, i--, j--
case 2: match x_i, gap, i--
case 3: match gap, y_j, j--
}
}
while (n==0) {match remaining substring X using gaps of Y}
while (m==0) {match remaining substring Y using gaps of X}