connectionist temporal classification

Brief introduction:

Connectionist temporal classification is used for sequence learning after the last layer of RNN. For a sequence with length T, every unit point of RNN network will output an softmax vector as the prediction probability. CTC defines the RNN network output as a probability distribution over label squences and is trained by minizing the negative log-likelihood of the true label sequences over a training set.

The exponential number of possible alignment between a label sequence and an observation sequence is kept in check by an evaluation strategy that is inspired by the forward-backward algorithm for HMMs.

Description of CTC in detail:

  • Assuming there is a sequence with a length of T, every label should be one of n labels (n-1 classes or ).

    • i.e. Let x ∈ (R^m) be an input sequence of length T and y ∈ (R^n) the sequence of outputs produced by a network that classifies each frame into one of n − 1 classes or blank.

    Let L^T be the set sequences of length T over the alphabet {1, . . . , n}. Then a sequence π ∈ L^T of labels of the same length as x has the probability.

1541751383009

  • conditioned on the observation x, π is a path

  • Define ß: L^T → L^{≤T} as the function that maps a path to a sequence.

    • with all blanks and repeated labels removed from the path, for example ß(aa − −ab−) = ß(a − − −a b−) = aab.

      Then a probability of any label sequence l of length at most T can be computed by:

      1541752101606

  • For any t, the proability of a labelling l:

    • Training by maximizing the log-likelihood with this formula naively would require you to
      sum over all paths corresponding to a label and “in general there are very many of these”
      (Graves). However, the combinatorial explosion can be avoided by dynamic programming
      similar to the forward-backward algorithm for HMMs. For a labelling l define the total
      probability of its first s labels l1:s at time t as:

      1541752309695

      where α is called the forward variable

      Define a backward variable β for suffix probabilities in an analogous way.

      The important insight is that αt(s) can be computed recursively from αt−1(s) and αt−1(s − 1) and similarly for β.

      Then for any t the probability of a labelling l can be rewritten as:
      1541752441303

  • the loss function and gradient can be derived from the noted functions. SGD is for optimization.

  • Decoding after CTC output:

    • Best path decoding

    1541752614046

    • Prefix-search decoding

      • not recommended due to its worst-case runtime
    • Viterbi decoding

      • Recommended. It produce the most likely sequence of hidden states (used for HMM)