connectionist temporal classification

Brief introduction:

Connectionist temporal classification is used for sequence learning after the last layer of RNN. For a sequence with length T, every unit point of RNN network will output an softmax vector as the prediction probability. CTC defines the RNN network output as a probability distribution over label squences and is trained by minizing the negative log-likelihood of the true label sequences over a training set.

The exponential number of possible alignment between a label sequence and an observation sequence is kept in check by an evaluation strategy that is inspired by the forward-backward algorithm for HMMs.

Description of CTC in detail:

Assuming there is a sequence with a length of T, every label should be one of n labels (n-1 classes or ).
- i.e. Let x ∈ (R^m) be an input sequence of length T and y ∈ (R^n) the sequence of outputs produced by a network that classifies each frame into one of n − 1 classes or blank.
Let L^T be the set sequences of length T over the alphabet {1, . . . , n}. Then a sequence π ∈ L^T of labels of the same length as x has the probability.

1541751383009

conditioned on the observation x, π is a path

Define ß: L^T → L^{≤T} as the function that maps a path to a sequence.
- with all blanks and repeated labels removed from the path, for example ß(aa − −ab−) = ß(a − − −a b−) = aab.
  
  Then a probability of any label sequence l of length at most T can be computed by:
For any t, the proability of a labelling l:
- Training by maximizing the log-likelihood with this formula naively would require you to
  sum over all paths corresponding to a label and “in general there are very many of these”
  (Graves). However, the combinatorial explosion can be avoided by dynamic programming
  similar to the forward-backward algorithm for HMMs. For a labelling l define the total
  probability of its first s labels l1:s at time t as:
  
  where α is called the forward variable
  
  Define a backward variable β for suffix probabilities in an analogous way.
  
  The important insight is that αt(s) can be computed recursively from αt−1(s) and αt−1(s − 1) and similarly for β.
  
  Then for any t the probability of a labelling l can be rewritten as:
the loss function and gradient can be derived from the noted functions. SGD is for optimization.
Decoding after CTC output:
- Best path decoding
- Prefix-search decoding
  - not recommended due to its worst-case runtime
- Viterbi decoding
  - Recommended. It produce the most likely sequence of hidden states (used for HMM)

connectionist temporal classification

Brief introduction:

Description of CTC in detail:

近期文章

近期评论

标签

热门

文章归档

分类目录

功能